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PREFACE 


In this book I have assumed acquaintance on the part of the 
reader with current practice in mental testing. My dealing 
with this, therefore, is in the form of criticism rather than 
exposition. I have not tried to cover the ground which 
other, more competent writers have already covered. A 
short list of suggested general reading is appended, however, 
for the benefit of those readers who are not familiar with the 
work discussed especially in the middle third of this book. I 
have not attempted a complete bibliography. Specific refer- 
ences are made, in the form of footnotes, in the text. 

I should like to express my thanks to a great many friends 
and colleagues for their help in the preparation of this book. 
A number of kindly critics, both in England and in the 
United States, have read it or parts of it and have given me 
the benefit of their comments. I am particularly grateful to 
my colleagues in the Cambridge Psychological Laboratory 
with whom I discussed Chapter XII and who are largely 

- responsible for the more sensible suggestions therein. 

This chapter, with those immediately preceding and 
succeeding it, provide a more positive statement of my views 
on intelligent activity and its appraisal than is found in the 


others, For this reason some readers—especially those who 


are unfamiliar with contemporary literature on intelligence 
tests—may do well to read these last chapters first. 

I also wish to express my gratitude to Professor 
P. E. Vernon. for so kindly allowing me to quote at 
length, in Chapter V, from his book The Structure of 
Human Abilities. 

CAMBRIDGE. 


A. W. H. 
January, 1954. 
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Chapter One 


INTRODUCTION 


The purpose of this introduction is three-fold: first to amplify 
the title of the book, secondly to give some indication of the 
types of reader for whom it is intended and thirdly to defend 
its terminology—in particular the frequent use of such 
cumbrous and ugly terms as ‘psychometrist’ and ‘test- 
retest consistency’. A last aim might perhaps be added: to 
justify the production of yet another book on intelligence. 

The word ‘appraisal’ was chosen by virtue of its overtones. 
It may be contrasted with measurement on the one hand and 
with subjective judgment on the other, although a few sym- 
pathetic vibrations of both of these may be apprehended. 
The use of ‘appraisal’ is intended to suggest probability 
rather than certainty and an overall rather than a narrow 
view. It is hoped to include also the possibility of unforeseen 
changes developing from the particular data available for 
any one individual at any one time. 

‘Intelligence’ was chosen in preference to some more 
general term since relatively little space will be devoted to 
discussion either of highly specific cognitive qualities or of 
temperament and character. By this I do not mean to 
postulate a clear-cut distinction between the intellectual and 
the emotional. On the contrary ‘intelligence’, as I under- 
Stand it, cannot be separated from other aspects of mental 
activity. It is hoped that a reasonably clear connotation 
will emerge in the course of the chapters which follow. 

The book is intended for those members of the community 
whose work brings them into contact with psychological 
tests or with their results: school teachers and other educa- 
tionists, careers masters and other vocational counsellors, 
Psychiatrists and psychiatric social workers, personnel 

I 


x 


2 Appraisal of Intelligence 


managers and employment officers. It is meant too for 
students who have completed at least one year of psychology 
or who are, in any case, already familiar with the essentials 
of mental testing. Itis not intended in any sense as a general 
introduction to psychological testing; in fact, a reader 
unacquainted with the relevant elementary works is likely 
to find much of this book confusing and disheartening. The 
loaded bookshelves of the proverbially intelligent layman 
should probably not be burdened with a copy. It is hoped, 
however, that some chapters may be of interest to the experi- 
mental psychologist and the psychometrist. 

The latter term will be used throughout to apply to those 
whose work is mainly or entirely concerned with the devising, 
‘standardizing and validating of psychological tests. The 
connotation of ‘mental tester’ would be too narrow: that 
phrase might well apply to those who merely administer the 
tests—a function which is, perhaps regrettably, performed 
less and less by the devisers and interpreters of the tests. 
‘Psychologist’, on the other hand, would be too broad: it 
would imply an interest in mental behaviour generally, 
which is not always possessed by contemporary psychome- 
trists, 

Unwieldy phrases such as ‘test-retest consistency’ are used 
only when they are thought to lend clarity and precision, 
as in the chapter on ‘reliability’. Psychologists often invent 
and employ technical jargon which obscures their meaning. 
I have tried therefore to avoid lengthy and esoteric terms 
wherever possible. 

The Appraisal of Intelligence is not intended as a text-book. 
It is far from comprehensive, even within its limited field; 
it leaves many gaps in both subject-matter and references; 
in particular, the important statistical and physiological 
problems which impinge on certain problems of. intelligence, 
are left virtually untouched. The book is very largely 
critical: it is easier to produce destructive criticism than 
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constructive suggestions which will withstand the assaults of 
time—and of the criticized. Moreover, several of the recom- 
mendations made will be found to belong already to the 
‘pre-scientific’ age of mental testing, the age when the indivi- 
dual subject held his own against theoretical constructs and 
when conflicting results could be neither ignored nor 
resolved by means of ingenious statistical devices. 

This regression to a previous era is not unconscious. It 
seems to me that the earlier investigators had something 
that contemporary psychometrists have lost. In fact it 
appears doubtful if the early work which was so impressive 
and fruitful would have been either, had it been initiated by 
its more rigid and arbitrarily-minded descendants. The 
latter, of course, do have an important contribution to make 
to the theory and practice of mental assessment but they err, 
in my view, in thinking to supplant the older methods rather 
than to develop them. 

The tendency of the last few decades has been towards 
elaborate statistical techniques combined, at some later 
Stage, with over-simplified and sometimes irrelevant 
psychological interpretations; towards emphasis on mathe- 
matical exactitude and objectivity as though these have 
intrinsic value for the psychologist, even if statistical signi- 
ficance is achieved only at the price of psychological 
Significance; towards innumerable controversies over minor 
technical points which have effectively detracted attention 
from the major psychological problems. 

It may be that this development is a necessary stage in the 
evolution of the art of mental assessment. But it is to be 
hoped that it is only a transitional stage and that the time 
1s approaching when the recent advances in method will be 
recognized as merely useful means to an end. If this book 
effects a step in that direction, it will serve its purpose. 


a 


Chapter Two 


THE MEANING OF INTELLIGENCE 


Psychologists have been generous to a fault with their 
definitions of intelligence. Almost every writer on the 
subject has put forward his own definition and some, in the 
fullness of time, have even offered more than one—and have 
not always been constrained by considerations of com- 
patibility. A detailed discussion of all these definitions 
would constitute in itself a long and somewhat tedious 
book; and a search for a highest common factor, if we ignore 
the tacit agreement to differ, would prove as fruitless as 
exhausting. It is true that some of the apparent disagree- 
ments are mainly verbal but many of them reflect funda- 
mental differences of opinion concerning the concept defined. 

Even the classification of definitions is controversial. We 
might contrast the a priori with the empirical, but where then 
should Spearman be placed, for instance? He gives his three 
armchair noegenetic principles! on one page of The Nature 
of Intelligence and the Principles of Cognition and on another 
suggests that g is merely a convenient way of expressing the 
fact that performances on various types of cognitive tasks 
tend to intercorrelate—than which nothing could be more 
empirical. 

We might contrast the biological with the psychometric, 
considering first definitions which centre round adaptability 
and survival, emphasizing the importance of adjustment 
to the environment, and second, the type of definition which 
at its crudest and frankest defines intelligence as ‘that which 
is assessed by intelligence tests’.2 


1 See page 14 of this book. 


* See, e.g., Boring, E. G., ‘Intelligence as the Tests Test It’ (1923), 
New Republic, 35, pp. 35-7. 
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On the other hand, we might prefer to contrast those 
definitions which lay stress on the theoretical and academic 
sorts of intelligence with those which make practical success 
their criterion. A good example of the first type is Terman’s 
capacity ‘to carry on abstract thinking’. The second would 
include Woodrow’s oft-quoted ‘capacity to acquire capacity’, 
since he elaborates his aphorism as follows: ‘Capacity for 
such mental activity as is most effective in bringing about 
success or . . . capacity for success in so far as success is 
dependent upon mental processes, either past or present.’ 
(‘Success’ he defines, a little later, in terms of ‘success’.) 

Little is to be gained from a Procrustean attempt to force 
existing definitions into any one of these classifications. I 
shall content myself with discussing briefly a number of 
suggested definitions and, more fully, a few representative 
ones, noteworthy on account of their longevity, their 
ingeniousness or the eminence of their author. 

In 1921, the Editor of the Journal of Educational Psychology 

invited seventeen psychologists to take part in a symposium. 
Contributors were asked to write brief answers to the follow- 
ing two questions: 
‘i. What I conceive “intelligence” to be, and by what 
means it can best be measured by group tests. (For example, 
should the material call into play analytical and higher 
thought processes? Or, should it deal equally or more 
considerably with simple, associative and perceptual 
processes, etc.?) 

‘2, What are the most crucial “next steps” in research?” 


Of the seventeen psychologists approached, fourteen 
responded. Very roughly, these can be divided into four 
groups: 

(a) Those who laid the main stress on learning. For 
example, Buckingham wrote of intelligence as ‘ability to 
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learn . . . measured by the extent to which learning has 
taken place or may take place’ and Dearborn wrote that 
intelligence tests should measure ‘actual learning rather 
than... the results of learning’. He stated also that mental 
age depends on at least three factors: (i) native intelligence, 
(ii) physiological maturity, (iii) environment—which -is 
unexceptionable but leaves (i) undefined. : 

(b) Adjustment or adaptability in some form was a fairly 
popular choice. Colvin, for instancé, combined something 
of (a) with (b) by defining an individual as intelligent ‘in 
so far as he has learned, or can learn to adjust himself to his.. 
environment’. Pintner wrote of adaptability to new situa- 
tions; he added that intelligence is used in dealing with 
things and people as well as words and symbols, Somewhat 
similarly Peterson wrote of intelligence as ‘a mechanism 
for adjustment and control . , . operated by internal as 
well as by external stimuli’. Both of these emphasized 
the need, in their view, for different kinds of intelligence 
tests. a 

(c) The contributors in this group gave their definitions in 
the form of enumerated qualities. Thus, for Haggerty, intelli- 
gence is active, innate, qualitative and quantitative, com- 
plex. Thurstone’s list was more complicated: ‘(i) the 
capacity to inhibit an instinctive adjustment, (ii) the capa- 
-city to re-define the inhibited instinctive adjustment in the 
light of imaginally experienced trial and error, (iii) the 
volitional capacity to realize the modified instinctive adjust- 
ment into overt behaviour to the advantage of the individual 
as a social animal,’ . 
"Group (d) consists of those few cautious contributors who 
declined to commit themselves, professing either lack of 
interest, as did Pressey, or lack of data, as did Ruml.. The 
latter, however, did g0 so far as to suggest that every test 
should be kept as ‘pure’ as possible, In this he differed from 
most of his fellow symposiasts and foreshadowed the trend of 
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contemporary psychometrists to ‘purify’ their tests to the 
utmost. Eepo 

Points which many, but by no means all, contributors 
had in common were a belief that intelligence. is highly 
complex and- that tests of intelligence should be very 
mixed; and, associated perhaps with this, a keen desire 
to see further research into tests of character and. of 
temperament, to supplement the more cognitive types of 
tests. This was a very frequent reply to the editor’s second 
question. f 

The symposiasts were both more and less ambitious than 
their successors—and than themselves when of more mature 
years. On the one hand, some of them attempted to face 
such:problems as the relations of originality and of imagina- 
tiveness to intelligence. On the other hand, their standards 
of precision and of objectivity were for the most part lower 
than are usually demanded today. g 

I shall end this brief survey of the 1921 Symposium with 
two excerpts, one from Terman and one from Woodrow. 
They both serve as sidelights on the progress and the inertia 
of the last thirty years of intelligence testing. 

Terman elaborated his dictum that ‘an individual is 
intelligent in proportion as he is able to carry on abstract 
thinking’ by stating that the ‘essential difference, therefore, 
[between the moron and the intellectual genius] is in the 
Capacity to form concepts to relate in diverse ways, and to ` 
grasp their significance’. He goes on as follows: ‘One may, 
of course, question our grounds for designating any kind of 

„mental activity as “higher” or “lower” than another. Why, 
1t may be asked, should certain types of mental processes be 
singled out for special worship? In fact, it is frequently 
‘Intimated that the individual who flounders in abstractions 
but is able to handle tools skilfully, or play a good game of 
baseball, is not to be considered necessarily as less intelligent 
than the individual who can solve mathematical equations, 
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acquire a huge vocabulary, or write poetry. The implica- 
tion is that the two individuals differ merely in having 
different kinds of intelligence, neither of which is higher nor 
better than the other. It is difficult to argue with anyone 
whose sense of psychological values is disturbed to this 
extent’! (Present writer’s exclamation mark.) It may be 
worth noting, however, that some years later, in his book 
with Merrill on Measuring Intelligence, Terman writes of ‘the 
behavioural composite which we call intelligence’, and 
comments on the unevenness of the manifestation of intelli- 
gence in the individual. ! 

The excerpt from Woodrow is far shorter. After having 
discussed ‘capacity’ and ‘success’ at some length he observes 
that the ‘best tests . . . will be tests of the simpler mental 
functions in young or unintelligent children and tests of 
the more complex functions in older or more intelligent 
ones’, 

The last point would sound too obvious to need saying oF 
repeating, were it not for an increasing tendency to use 
certain tests as all-purpose tests, regardless of the level and 
the composition of the group tested. Recognition of some of 
the inevitable consequences of this (particularly those which 
affect ‘validity’) may be inferred from the current practice 
of employing various statistical corrections ‘for homogeneity 
and ‘for attenuation’: inappropriateness of test for grouP 
may well produce a decrease in range of score or in test- 

' retest consistency. Whether or not these devices are 
Statistically legitimate, they leave untouched the psycho- 
logical effects of using inappropriate tests (when the group 15 

_ predetermined), or inappropriate subjects (when the test 15 

" predetermined). These are examined below (Chapters IX 
and X).1 

Three further definitions of intelligence will be discussed 


*See also Heim and Batts: ‘ 


e Upward y lection in 
Tnveligenies Testing’ ( pward and Downward Selec! 


1948), Brit. 7. Psychol., xxx1x, 1. 
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in some detail: those of Thorndike (who edited and also took 
part in the Symposium); of Spearman and of Binet. 

Thorndike wrote in 1921: ‘Realizing that definitions and 
distinctions are pragmatic, we may then define intellect in 
general as the power of good responses from the point of view of 
truth or fact? The problem of estimating these is as old as the 
oldest philosophies. Moreover, if we accept Thorndike’s” 
definition, even without fully understanding it, how are we 
to accommodate the intelligent swindler or forger, for 
instance, who successfully denies the truth and distorts the 
facts? 

A few years later, in his book on The Measurement of 
Intelligence, Thorndike wrote rather differently. It is 
peculiarly difficult to summarize his views therein on the 
meaning of ‘intellect’ or ‘intelligence’ since he develops the 
theme in the course of the book and he attributes now a 
psychological, now an almost physiological, and now a 
purely empirical meaning to the term. > 

He begins with what he calls a ‘first approximation’. 
‘Let intellect be defined,’ he says, ‘as that quality of mind (or 
brain or behaviour if one prefers) in respect to which Aris- 
totle, Plato, Thucydides, and the like, differed most from 
Athenian idiots of their day, or in respect to which the 
lawyers, physicians, scientists, scholars, and editors ofreputed 
greatest ability at constant age, say a dozen of each, differ 
most from idiots of that age in our asylums.’ 

This differentiation sounds perhaps so extreme as to be 
Obvious and virtually useless. In two respects, however, it 
1s very important. First, it contains the germ of an external 
criterion—the need for which contemporary writers tend to 
neglect. Thorndike clearly recognized the importance and ty 
the difficulty of finding such a criterion. Evidently, for him 
‘validation would have consisted in testing subjects whose 
8eneral level of intelligence was already agreed on—this 


agreement having been reached without reference to mental 
2 
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tests of any kind—and then comparing the test scores of the 
pre-established superior subjects with those of the pre- 
established inferior ones. This is surely a more meaningful 
‘validity’ than that which depends, circular fashion, on 
internal consistency or a high degree of saturation or a close 
association with some existing ‘validated’ test. 

Secondly, Thorndike postulated his conditions with great 
care: he specified ‘Athenian idiots of their day’ and ‘of reputed 
greatest ability at constant age’. When we compare this 
broad and cautious ‘first approximation’ with modern 
‘scientific? methods in which the intelligence of subjects of 
all ages (not excluding eminent historical figures) is 
expressed in terms of intelligence quotients, and when we 
observe the rise or fall of a ‘nation’s intelligence’ being de- 
bated in similar terms, regardless of changing fashions in 
education and in mental testing procedure, we may gain 
some idea of the swiftness with which an ever-narrowing 
road has been travelled during the intervening years. 

Thorndike goes on to distinguish between altitude (or 
level) of intellect, extent (or area) of intellect and speed of 
intellect. He says ‘For rigorous measurements . . . it seems 
desirable to treat these three factors separately, and to know 
the exact amount of weight given to each when we combine 
them.’ This is less straightforward. By ‘level’, Thorndike 
means the degree of difficulty of a task, estimated by the 
number in a given group who complete it successfully, 
given unlimited time; by ‘area’, he means the number of 
different tasks of equal difficulty, performed. Although this 
is related to ‘level’it is not identical, since individuals differ 
as to what they find difficult: they therefore vary consider- 
ably in their ability to do tasks which theoretically they 
should be able to do equally well—the tasks having been 
equated for difficulty, as defined above. Thorndike states, 
however, that level and area of intellect are correlated and 
that either one is an indicator of the other. 
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This problem points the difficulty of attempting to treat 
intelligence as a measurable, one-dimensional entity. Once 
it is accepted that A can do Tasks T 1, 2 and 3 but fails on 
Tasks $1 and 2, whilst B successfully completes $ 1, 2, 3 and 
4 but cannot begin to tackle T 1, some sort of compromise 
becomes imperative. We can choose between (i) asserting 
that T (or S) tasks demand real intelligence whereas S (or 
T) demand merely specific quality s (or specific quality £); 
(ii) combining T tasks with S tasks to form a test, concen- 
trating our doubts on the relative weighting; (iii) wondering 
whether perhaps ‘pure intelligence’ which admits of no 
qualitative differences is not a poor servant and an intoler- 
able master. 

The fact of confusion becomes obvious in the formulation 

of Thorndike’s third theorem: ‘Other things being equal, if 
intellect A can do at each level the same number of tasks as 
intellect B, but in less time, intellect A is better. To avoid 
any appearance of assuming that speed is commensurate 
with level or with extent, we may replace “better” by 
“quicker”? Thus, Thorndike, unwilling to assume that speed 
of work is necessarily associated with superior quality of 
intellect and aware that this is unverified if not unverifiable, 
is driven to postulate as a theorem that if A can do the same 
as B but in a shorter time, A is quicker. 
_ In my view, ‘level’, ‘area’ and ‘speed’ are inextricably 
intermingled and this throws some doubt on the simple 
concept of ‘difficulty’, defined in terms of group perfor- 
mance, which Thorndike and his successors have adopted. 
“ore recent controversies are usually expressed in terms of 
Power versus speed’: although the terminology has altered, 
the over-simplification of the concepts remains. I shall 
return to this question in Chapters IX and XII. 

There remain two major points in Thorndike’s exposition 
which I designated above as ‘near-physiological’ and ‘purely 
empirical’, The first of these is his ‘hypothesis that quality 


12 Appraisal of Intelligence 


of intellect depends upon quantity of connections’. By 
‘connections’ he means possible association of ideas, or 
images or concepts; but he wishes also to include ‘whatever 
anatomical and physiological fact corresponds to the possi- 
bility of forming one connection or association or bond be- 
tween an idea or any part or aspect or feature thereof and a 
sequent idea or movement or any part or aspect or feature 
thereof’. 

He distinguishes between correct and incorrect connec- 
tions and between original capacity and acquired knowledge. 
But, as he recapitulates, ‘the gist of our doctrine is that, by 
original nature, the intellect capable of the highest reason- 
ing and adaptability differs from the intellect of an imbecile 
only in the capacity for having more connections of the sort 
described’. 

As with most physiological theories of intelligence, it is 
logically possible; it takes little account of qualitatively 
different mental processes; it becomes obscure where it 
should be most clear (‘whatever . . . fact corresponds to the 
possibility of . . .’); it is, at the present stage of brain 
physiology, unprovable and irrefutable; and it bears little 
or no relation to the theory and practice of assessing mental 
capacity. 

It is probably for his practical contribution to this assess- 
ment that Thorndike is best remembered within this field 
and it is this that I had in mind when designating one of his 
approaches as ‘purely empirical’: I refer to Thorndike’s 
‘Intellect CAVD’ and his ‘CAVD Tests’. The symbol 
CAVD refers to four types of test, Completion, Arithmetic, 
Vocabulary and Directions. 

_ Writing in 1925, Thorndike believed that ‘very substan- 
tial agreement’ would be found among ‘competent psycho- 
logists’ as to the products and tasks which ‘depend primarily 
upon intellect’. For practical Purposes in the testing of 
intellect, he was satisfied with CAVD, with its sections of 
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ever-increasing difficulty. He added, however, that ‘intellect 
may be CAVD, or CAVD plus ability in giving the opposites 
of words, making it CAVDO; or that, plus insight into 
spatial relations, making it CAVDOS; or that, plus ability 
in inductiveand deductive reasonings, makingit CAVDOSR, 
and so on.’ 

This is clearly an over-simplified and unsatisfying state- 
ment of the case. It would imply, for instance, that recog- 
nizing synonyms (as is required in Thorndike’s Vocabulary 
test) is a mental process necessarily different in kind from 
that of finding opposites; that reasoning does not enter 
appreciably into Completions or Arithmetical problems; 
that obeying instructions exactly plays a role in determining 
test score only when the test is named Directions; and so on. 

It is true that similar distinctions and names are frequent 
in contemporary writings but such classifications today rely 
On statistical techniques for their justification: they make 
no claim to the cautious and psychological manner in which 
Thorndike explored the possibilities of a useful battery of 
intelligence tests. His contribution was valuable, owing 
largely to his raising so many of the complex problems 
inseparable from any theory of intelligence. He was con- 
stantly challenging and readjusting his ideas by setting them 
against the beliefs of common sense and the terms of 
common usage. 


The case of Spearman is very different. He sometimes 
(though not often) challenged common sense and common 
terminology to compete with his theories—to the occasional 
discomfiture of the first two. In order to gain some under- 
Standing of Spearman’s theory of intelligence and intelli- 
gence tests, we must first examine his three famous ‘noegene- 
tic’ Principles of Cognition as propounded in his Nature of 
Intelligence and Principles of Cognition. 
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1. Apprehension of Experience. ‘Any lived experience tends 
to evoke immediately a knowing of its characters and 
experiencer.’ 

2. Eduction of Relations. “The mentally presenting of any 
two or more characters (simple or complex) tends to evoke 
immediately a knowing of relations between them.’ 

3. Eduction of Correlates. ‘The presenting of any character 
together with any relation tends to evoke immediately a 
knowing of the correlative character.’ 


Spearman coined the term ‘noegenetic’ to apply to his 
Principles, in order to emphasize his belief that (a) they are 
all self-evident and (b) they generate items of cognitive 
content. “They, and they alone,’ he writes ‘are generative 
of new items in the field of cognition.’ 

I shall briefly discuss these three principles since, accord- 
ing to Spearman, the word ‘intelligence’ ‘covers all three 
noegenetic principles in every one of their manifestations’ 
and his frequent references to noegenetic principles and 
processes (even in his later works)? bears witness to the great 
importance which he attaches to them. 

The first principle “Apprehension of Experience’ appears 
to be a statement of the fact of introspection. As stated, it 
might be inferred that Spearman holds that explicit aware- 
ness of oneself as experiencer, and of the nature of one’s 
experience, is always present, save in very exceptional 
circumstances, However, he later goes out of his way to 
stress the word ‘tends’ in his statement of the principles and 
to affirm his belief that the experience and the knowing of 
1t are two separate processes, 

Introspective and retros 


r pective evidence support the 
necessity for the latter disti 


nction. If, however, we accept 

1 Spearman, C, E., Abilities of 

* Spearman, C, 
Macmillan, 


f Man (1927), London: Macmillan. 
E. and Wynn Jones, Ll., Human Ability (1950), London: 


not 
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this distinction and the strong emphasis on ‘tends’ —which 
suggests that it designates ‘often’ rather than ‘nearly always’ 
—we are left with something like this: Many people when 
living through an experience are often aware, at the time, 
of themselves as experiencers and of the kind of experience 
through which they are living. 

It would be irrelevant to discuss here the evident exclusion 
of the lower animals and, perhaps, of young children from 
Spearman’s ‘any lived experience’; the uniformly passive 
role assigned to the ‘liver’; and the extent to which the 
awareness (whether of self or of experience) is explicit. As 
the principle now stands it is innocuous: that some people 
at some times have the postulated double awareness cannot 
be denied. Still, the connection between this fact of intro- 
spection and the other two principles is obscure, as is the 
connection between it and degree of intelligence, however 
‘defined. The infrequency with which Spearman later refers 
to his first principle, either by name or by implication, 
Suggests that it stands in a somewhat step-brotherly 
relation to the other two. 

The remaining two principles have a great deal in 
common. They both constitute examples of types of deduc- 
tive reasoning—types which prove sufficiently similar to 
create confusion among some of the illustrations given by 
Spearman, 

The second principle, Eduction of Relations, is quite 
Straightforward, considered in isolation. Faced with a white 
patch and a coloured patch, we may immediately apprehend 
the relation different; presented with the words ‘loud’ and 
‘soft’, we may educe the relation opposite; hearing the musical 
Notes C and Æ flat, we may educe the relation different or, 
Perhaps, higher or, possibly, minor third. These illustrations 
are simple and relatively uncontroversial. However, 
Spearman offers as one of his illustrations of his second 
Principle the typical analogy found in intelligence tests, 
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whether verbal or diagrammatic: he gives, for instance, 
‘Warmth is to Stove as Sharpness is to Fireplace, Tool, Hear, 
Cut’. This seems to me rather an example of his third 
principle, Eduction of Correlates. 

In order to transform itinto a problem of educing relations, 
or into a pair ofsuch problems, it would have to be expressed 
in some such form as the following: ‘Warmth : Stove— 

what relation?’ and ‘Sharpness : Tool—what relation?’ And 
_ the response would be some fairly complex relationship such 
as ‘essential function’ or—since sharpness is scarcely a 
function—‘attribute possessed by’. The fact that the original 
analogy is not particularly cogent renders difficult the 
finding of one relation wholly appropriate to both pairs. 

Certain types of problem fall naturally into the form of 
Spearman’s second principle and others fall naturally into 
the form of the third principle. It is probable, however, 
that with suitable rewording the two can always be 
interchanged. 

Non-controversial examples of Eduction of Correlates 
would include the following: ‘Given the colour orange and 
the relation complementary, we educe the correlate blue; given 
the word loud and the relation opposite, we educe the 
correlate soft; given the musical note C and the relation 
minor third above, we educe the correlate E flat? 

For both eductive principles, the type of illustration and 
the degree of complexity can be varied indefinitely. The 
fundaments between which the relations subsist may be 
more than two, as in ‘A is greater than B and B is greater 
than C’, and in the analogy given above; the fundaments 
may themselves be relations, as in a comparison between 
two musical chords, for instance; and so on. 

In addition to his statement that the word ‘intelligence’ 
covers the three noegenetic principles in all of their mani- 
festations, Spearman often implies that the three principles 
in all their manifestations cover ‘intelligence’. However, we 
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have not yet a full picture of all that he signifies by the word 
in his many writings. In order to gain some understanding 
of this we must consider his coining and use of the term g; 
his ‘theory of two factors’; his views on general ‘energy’ or 
‘power’, and specific ‘engines’—to retain his terminology. _ 

Spearman observes that some individuals tend to do well — 
in a number of widely differing tasks whilst others, on the — 
contrary, do more or less uniformly badly. As he puts it: 
‘Such [positive] correlations . . . exist when even the form - 
of the operation is no longer the same but widely unlike.’ 
In view of the general tendency of one individual to have, 
roughly, the same degree of success whatever task (or, 
rather, psychologist-devised test) he attempts and another 
to have equally consistently, a different degree of success, 
Spearman postulates a general factor of which individuals 
possess differing amounts. Since the tasks may vary 
considerably, one from another, and the correlation is in 
fact far from perfect, he further postulates a specific factor, 
peculiar to each type of task. He considers his hypothesis to 
bea psycho-physical one and uses throughout the physical 
analogues of ‘energy’ for his general factor and ‘engines’ for 
his specific factors. 

The role of these analogues is more important than 
might appear at first sight, since Spearman drifts impercep- 
tibly into a more and more literal understanding of them, 
especially of ‘energy’: it is clear that there are early limits to 
employment of the word ‘engines’ at all literally, in such a 
context. On the other hand, careful re-reading of Spearman 
reveals his wisdom in invoking the aid of these analogies 
Since many of his reflections would prove inexpressible, or 
highly unlikely, in non-figurative terms. 

The ‘theory of two factors’ is then a theory of one general 
factor and an unspecified number of specific factors. His 
general factor Spearman elects to call g since, as he writes: 
There emerges the concept of a hypothetical general and 
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purely quantitative factor underlying all cognitive perfor- 
mances of any kind. Such a factor as this can scarcely be 
given the title of “intelligence” at all; being evoked to explain 
the correlations that exist between even the most diverse 
sorts of cognitive performance, it does not deserve a name 
appropriate to any one particular sort.” 

Thus, at that stage (1923), g was merely a short-hand way 
of expressing the fact that psychological tests of a cognitive 
kind are liable to correlate positively with one another. 
Later, for Spearman and his followers, ‘proofs’ of ‘the 
existence of g’ consisted of this tendency towards positive 
intercorrelations and the frequent finding of hierarchical 
orders or zero tetrad differences, in the matrices of inter- 
correlations. The intercorrelations can of course always be 
given a helping hand upwards, as those tests ‘with low £ 
become eliminated from genuinely diverse tests batteries. 
Evidently for Spearman the fact that the hierarchies and the 
zero tetrad differences are necessary to the truth of the Two 
Factor Theory renders them sufficient to prove it. 

Prima facie, there is a great deal to be said for calling that 
which tests test, or that which causes tests to intercorrelate, 
by the name of arbitrarily chosen letters of the alphabet, 
such as g, k, m, etc. It should constitute a confession of 
ignorance, indicating that we do not really know what the 
test assesses, if anything, apart from degree of ability to do 
the test and that we have no presuppositions as to the nature 
of the cognitive process involved nor as to what constitutes 
‘validity’ for the tests concerned. But the advantages of 
using arbitrary symbols were lost almost as soon as Spear- 
man set the fashion. He was himself the first sufferer since 
he sometimes slipped, insensibly, into equating g with 
intellect or intelligence, as understood by him—and even as 
generally understood. 

It is now common practice for the nursery school teacher, 
the man in the street, the personnel manager, the colonel 
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and the employment officer to refer to people as having 
‘very high g’ or ‘very low g’ (rarely, ‘average g’), without the 
slightest knowledge or interest in any psychological tests 
which the individuals may or may not have taken. Using 
the term g in this way, we get the worst of both worlds: 
subjective judgments are lent a spuriously scientific air, and 
the original caution and precision are lost—the loss being 
disguised by continued wearing of the modest garb! And 
when influential psychometrists adopt the same habit the 
effects are still more catastrophic. 

It may be that the spurious science and the extravagant 
claims are inevitable while psychometry in its present form 
continues to enjoy its present vogue. In that event, it would 
seem the lesser evil frankly to call psychological methods of 
assessment ‘tests of intelligence’, ‘tests of spatial perception’, 
etc., than to call them ‘g tests’ and ‘k tests’, since in this way 
the self-deception is more apparent. 


Binet is discussed third although his early work preceded 
Thorndike’s and Spearman’s by some years. He is one of 
the most advanced writers in the field, in the sense that the 
most valuable ideas and methods in contemporary psycho- 
metry result largely from his work and that few, if any, of his 
Successors have approached him in psychological under- 
Standing and invention, and balanced judgment. The fact 
that he wrote in French, may partly account for his clarity 
and vividness of expression. He had no particular axe 
to grind and was thus willing to remain flexible in his 
approach throughout his work on intelligence. His claims 
Were less ambitious and better justified than those of his 
Successors, 

Binet’s approach was at first purely empirical. From the 
1890's, he had been experimenting with various tests. His 
Now famous work on intelligence was stimulated originally 
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by a request from the Paris Municipal Education Authori- 
ties in the early 1g00’s for help in recognizing the mental 
defectives and the very dull children, who could gain no 
benefit from standard education and who should therefore 
be sent to special schools. It was suspected that certain 
children in these schools were backward only as a result of 
poor educational grounding, or early illness, or some 
specific inadequacy—in reading, for instance. On the other 
hand, some children in the ordinary schools proved in- 
capable of tackling even the simplest tasks. 

Between 1905 and 1911, Binet and Simon published five 
articles in l Année Psychologique which were later published in 
book form under the title The Development of Intelligence m 
Children. In these articles, the writers describe the confusion 
then reigning in regard to the recognition and classification 
of mental defects; they tidy up the nomenclature and suggest 
methods of diagnosis. This includes the devising and the 
validating of series of tests, against an external criterion and, 
in the course of this work, they formulate what appear to 

_ them to be the most useful hypotheses concerning the nature 
of intelligence, 

To remove some one statement from its context and label 
it “Binet’s definition of intelligence’ would be misleading. 

Binet was too wise to assign any cut and dried meaning to 
the term and he was at pains to point out the many different 
nuances and aspects of the concept. It is possible, however; 
to gain some understanding of the essence of intelligence for 
Binet, as his writings, his experiments, and his final series of 
tests all go to form a harmonious and self-consistent whole. 
The following is perhaps his nearest approach to a defini- 
tion: ‘It seems to us that in intelligence there is a funda- 
mental faculty, the alteration or lack of which, is of the 
utmost importance for practical life. This faculty is called 
judgment, otherwise called good sense, practical sense, 
initiative, the faculty of adapting one’s self to circumstances. 
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To judge well, to comprehend well, to reason well, these are 
the essential activities of intelligence. 

But, it may be argued, all these are different things: how 
can intelligence be judgment and comprehension and 
reasoning and practical sense and initiative. . . ? If intelli- 
gence is measurable and if individuals are to be compared 
with one another in respect of the amount they possess, it 
cannot be this untidy mixture of unprecise elements: it must 
and shall be purified. This argument expresses crudely and 
briefly the treatment meted out to ‘intelligence’ during the 
past three or four decades.+ 

Binet, however, managed to include in his series of tests a 
great many of the qualities he suggested. The main 
difference between him and other psychometrists is that he 
Started with a practical problem, thought about it and made 
tentative hypotheses as to the best measuring instruments— 
and proceeded to verify these by returning to the practical 
Problem. He then altered his tests so that the data they 
yielded accorded better with the facts. He did not at any 
Stage allow himself to be forced into distorting or omitting 
results which did not tally with his theories since these 
Were flexible throughout—both in the sense that they ` 
had a-healthy vagueness and momentum and that he 
expressed them as questions and suggestions rather than | 
dogma. 

A few examples may serve to make this clear. It will be 
remembered that Binet’s tests include such tasks as com- 
Pleting or copying a drawing; repeating a number of digits, 
cither as heard or in reverse order; threading beads in a 
Siven pattern; choosing the prettier of two faces; arithme- 
tical problems, social problems, absurdities, vocabulary; and 
many other types of task. Thus Binet’s tests and series of 


1 For emphasis on ‘purity’, see e.g. Guilford, J. P. and Lacey, J. I., 
ined Classification Tests on p. 849, and also much of the writing of 
urstone, L. L., and of Kelley, T. L. 
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tests are far from ‘pure’ and the ‘impurity’ is not fortuitous. 
It springs from his belief in the all-embracing nature of 
intelligence, and his endeavour to do justice to children 
from different homes and different schools, from town and 
from country. 

The variety of medium and of topics has the additional 
advantage of capturing the child’s interest and usually 
holding it throughout the test series. The variety in itself 
lends interest and, moreover, particular predilections are 
more likely to be catered for. The child who dislikes draw- 
ing may very well enjoy stringing beads; and the one who 
finds both these activities tedious may welcome ‘verbal 
absurdities’ as new and amusing. 

The contemporary psychometrist reasons: We cannot test 
everything that the layman calls intelligent behaviour so 
we shall narrow it down as much as possible and use the 
word in our own technical sense. We shall not call the 
assessed quality ‘intelligence’ unless it is innate, stable and 
objectively measurable, i.e. we must design our tests to 
minimize the effects of environment, eliminate any which 
fail to show high internal consistency and ignore any quality 
which is not amenable to the multiple choice of answer 
technique. 

Binet, on the other hand, observing that intelligence 
manifests itself in a multitude of ways and that children vary 
as to what they find difficult, argued: We must construct 
tests as varied as possible and see whether we can make our 
use of the word ‘intelligence’ square with that of common 
usage. We see that it is as fruitless to attempt to distinguish 
between the effects of heredity and environment in practice 
as it is in theory; we see also that to banish all thought of 
originality and creativeness from our concept of intelligence 
is unjustified. Therefore, we must devise tests some of which 
depend largely on previous training and experience for their 
successful completion, as far as possible restricting the 
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required experience to that which is common to all subjects 
in the normal course of events; and our tests should not all 
be of the multiple choice variety, since this form of response 
Successfully eliminates phenomenological differences in 
originality and initiative, 

Thus we find that Binet introduced variety not only into 
the medium and the subject-matter of his tests but also into 
the form of answer required. Whenever practicable, he 
required the child to produce his own answer. ‘In what way 
are an apple and orange alike?’ ‘What is the difference 
between wood and glass?’ ‘Give two reasons why most 
People would rather have a motor-car than a bicycle,’ In 
all questions of making comparisons and of giving reasons, 
the initiative and invention of the child are brought into 
Play. The context is less narrow, the rules less set, the con- 
ventions less arbitrary than in group paper and pencil tests. 

n original answer is possible and is not necessarily 
penalized. Again, the inclusion of several problems involv- 
mg a social element ensures more flexible interpretations 
than are allowed in most contemporary tests of intelligence. 

he planned omission from these latter tests of any problem 
with a concrete or social or aesthetic flavour results in an 
extremely narrow connotation for the term ‘intelligence’-— 
one which Binet would have deplored. 

erhaps as a result of his technique of testing, he was led 

to recognize the existence of degrees of rightness—and degrees 
Ol wrongness. For example: “The skin of an orange is yellow 
ut you eat an apple raw’ is a ‘poor’ wrong answer to the 
ast of the questions quoted above. On the other hand: 
you can’t fall offa motor-car’ is a ‘good’ if unorthodox part- 
reply to our third example. Binet was well aware of the 
Importance of uniformity in marking and he did not omit to 
Sve guidance on the matter: he laid down, for instance, that 
a definition which classifies the object named is superior to a 
efinition ‘by use’ alone; that an ‘absurd’ answer earns still 
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less credit than one which, though incorrect, is nevertheless 
within the right context. 

Thus giving and marking Binet’s tests is less rule of thumb 
than is the administering of most recently devised tests. 
Whether this be considered an asset or a drawback will 
depend on one’s general attitude to mental testing and 
testers. Certain it is that the intrusion of judgment in the 
giving of Binet’s tests is closely related to his views on the 
technique of testing. 

He strongly emphasizes that giving the tests and evaluating 
the responses is a highly skilled job; that the tester requires 
understanding and intuition in addition to a rigorous train- 
ing and considerable experience. He points out the dangers 
of a tester who intimidates or antagonizes the child or who, 
on the contrary, gives hints unwittingly of the kind of 
response that would be acceptable. He describes the teacher- 
tester who takes advantage of the test situation to impart a 
little information here and there or who indicates, when the 
child has made his response, whether it was right or wrong. 
He describes too the behavioural clues of which the percep- 
tive tester may avail himself when determining the test score. 

These practical points may sound too obvious to be worth 
mentioning after nearly fifty years of mental testing, but 
their importance has grown with every decade. Binet was 
not, of course, the only mental tester to point out the necessity 
of considering the-whole-child-in-the-total-test-situation. 
Many of the earlier experimenters realized this and stressed 
it in their writing.! But this point of view has received less 
and less attention as tests have steadily become more mass- 
produced, in every sense. 

One further point of major difference between Binet and 
his successors of the 1930’s and 1940’s is that of test valida- 
tion. Binet was concerned to find some exact, external 


1See, for instance, Ballard’s Group Tests of Intelligence, and Burt’s 
Mental and Scholastic Tests, both of which appeared in the early 1920's, 
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criterion with which to compare his test findings. As is well 
known, he observed and made use of the fact of mental 
development. Having noted that a normal child of six, 
for instance, can do things which a normal child of five 
cannot achieve, and that the six-year-old tends to fail on 
tasks which the normal seven-year-old can manage, he took 
‘success at age x’ as his criterion. 

The significance of this criterion, however, appears to be 
less well known, since Binet’s successors have either con- 
tinued to use it (with the necessary statistical adjustment) for 
subjects over sixteen years of age or have departed from it 
in favour of internal criteria, based on such data as item 
analysis or inter-test correlations. The former procedure 
makes nonsense of Binet’s very valuable contribution— 
with its necessary assumption of full development at fifteen 
or sixteen (an assumption based on results of tests originally 
devised for children or adolescents). The latter procedure 
ignores the lesson of Binet that the criterion of intelligence 
must be entirely independent of the tests which are intended 
to assess it, 

It was Binet who gave us the concept ‘mental age’. The 
concept of intelligence quotient (I.Q.), ratio of mental age 
to chronological age, derived from his work and will be 
discussed in Chapter IV. Binet found, naturally, a con- 
siderable overlap between the tests at all ages: a child of 
nine, for instance, may fail on one of the tests in year eight 
and yet pass two of the tests for year ten and one for year 
eleven. Binet did not on this finding necessarily banish such 
test items from his scale, in order to increase its ‘reliability’. 
Such results tallied with what he knew from everyday life 
of the way difficulty varies with the individual, with interest, 
with education and with general environment; they tallied 
with his concept of intelligence. 

He was, however, constantly altering and adding to his 
tests in the light of experience. He was well pleased with the 
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results of his experimental testing over the course of years. 

In fact, he expressed delight and surprise at their efficacy. 
But he wrote: ‘In the course of our explanation we have 
insisted on the character of our method of measuring. Not- 
withstanding appearances it is not an automatic method 
comparable to a weighing machine in a railroad station, on 
which we need but stand in order that the machine throw 
out the weight printed on the ticket.’ 


This point needs as much stressing now as it did when first 
written, 


. 
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SOME CURRENT DEFINITIONS OF INTELLIGENCE 


Contemporary statements as to the meaning of ‘intelligence’ 
are less varied than they were when psychometry’s chrono- 
logical age was tender. The majority of active workers in 
the field have accepted the approach of factor analysis in 
one or other of its forms, and their controversies, though 
frequent and animated, concern for the most part minor 
differences in interpretation arising from differences in 
statistical techniques and are resolved, if at all, by the same 
methods. It is significant that in a recently published 
psychological book of very general interest,} the chapter 
entitled ‘Intelligence’ (and contributed by a non-factorist) 
devotes thirty-one of its forty-one pages to ‘the factorial 
solutions’, 

I propose to consider the factorial point of view in Chapter 
V and to limit myself here to a brief discussion of two con- 
temporary definitions—that of J. C. Raven and of my own. 
It is perhaps not justified to criticize Raven’s definition of 
‘intelligence’ in so far as it is not strictly speaking his own. 
He tells us? that ‘according to the Oxford Dictionary, 
“intelligence” may mean either “understanding as a 
quality admitting of degree” or “a piece of information”? 
Raven goes on to say that ‘The two meanings of the word 
are equally important. In order to act intelligently in any 
Situation, a person needs both the necessary information 
and the capacity to form comparisons and reason by 
analogy.’ 

The saltus from ‘understanding as a quality admitting of 


1 Helson, H., Theoretical Foundations of Psychology (1951), D. Van 
Nostrand Co. 

? Raven, J. C., ‘The Comparative Assessment of Intellectual Ability’ 
(1948), Brit. 7. Psychol., XXXIX, 1. 
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degree’ to ‘capacity to form comparisons and reason by 
analogy’ is rather sudden. However, to do well on the 
Progressive Matrices test! the subject clearly needs this 
capacity and to gain a good score on Raven’s Mill Hill 
Vocabulary Scale? equally clearly requires the subject to 
possess certain ‘information’. The former is a carefully 
graded diagrammatic test comprising sixty items, each 
consisting of a matrix of eight figures requiring a ninth to 
complete it, the solution to be selected from among six (or 
eight) proffered figures. The latter is the usual type of 
vocabulary test, consisting of a list of words to be defined or 
explained, graded from short, easy words at the beginning 
to long, difficult words at the end. 

Raven links his theory with his practice, observing that 
‘For unequivocable inferences to be drawn from the results 
of mental tests, the two components of intelligent conduct 
must be considered separately’ and maintaining that these 
two tests together ‘cover as far as possible the whole range 
of intellectual development from the age of five onwards.”? 

I should agree with Raven that probably the two tests 
jointly yield a more satisfactory estimate of a subject’s 
general mental ability than would either test alone, On 


several points, however, he seems to me to mistake or to 
over-simplify. 


(a) The two tests can scarcely ‘cover the whole range’ 
seeing that they embrace such a small number of logical and 
psychological principles. Even if we confine ourselves within 
the framework of mental tests, for instance, no problems of a 
numerical or arithmetical nature are included; no verbal 
problems other than defining set words; no visual problems 
Cae oe cA me eo J. C., ‘A New Series of Perceptual Tests’ 


m ieee Mill Hill Vocabulary Scale (1943), London: 


3 Raven, op. cit. (1948). 
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of a pictorial or representational kind; and even the dia- 
grammatic problems are all presented in identical form. 
Raven may have had this latter point in mind when he 
stated that Progressive Matrices ‘is not a test of general 
intelligence, and it is always a mistake to describe it as such’. 1 
This does not, however, always tally with his other pro- 
nouncements, 

(b) I find it hard to accept that any vocabulary scale is, 
as Raven suggests, ‘a test of the general fund of information 
a person has acquired compared to other people, as the 
result of intellectual activity in the past’. Granted that it is 
difficult (though not necessarily impossible) to assess some- 
one’s ‘fund of information’ without words, it does not follow 
that a reliable cross section of his vocabulary will provide 
a valid assessment of all the information he possesses. Much 
of what he ‘knows’ may be in the form of non-verbal skills, 
for instance, which even a verbal-minded subject might find 
difficult to communicate in words. 

(c) Raven’s acceptance as a psychologist of the dictionary 
definition of intelligence seems to me to go both too far and 
not far enough. On the one hand, the ‘piece of information’ 
is surely referring to intelligence in a second sense (as, for 
example, ‘charm’ may signify either ‘quality or feature 
exciting love or admiration’ or ‘amulet, trinket or watch 
chain, etc.’), On the other hand, intelligent human 
behaviour is usually thought to extend beyond ‘understand- 
ing as a quality admitting of degree’ even if this be supple- 
mented by acquired knowledge. 


I should like to suggest that intelligent behaviour may 
admit of differences in kind as well as in degree. For me, 
intelligent activity consists in grasping the essentials ina 
given situation and responding appropriately to them. This 
can scarcely claim to be a definition and it is certainly not 

1 Raven, op. cit. (1948). 
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precise. I prefer ‘intelligent activity’ to ‘intelligence’ because 
I wish to avoid the suggestion of some one trait which an 
individual simply possesses, to a greater or lesser extent. I 
use the word ‘activity’ in its widest sense, not necessarily 
implying activity which is overt at the time. 

My connotation is intentionally vague because I should 
like it to cover the unprecise meaning understood by the 
man in the street—the increasingly rare individual who has 
not yet been introduced to ‘g’ and to ‘I.Q.’—and by the 
novelist and biographer; to apply to that which most ‘tests 
of intelligence’ do in fact assess; and to be consistent with a 
biological approach. Furthermore, I hope to indicate the 
intrinsic flexibility of intelligent behaviour, its variation 
with different individuals and different times and the extent 
to which ‘the most intelligent activity’ in any given situation 
may be controversial. 

The psychologist may choose among three courses: (i) to 
use the word ‘intelligence’ in a way which is far more 
restricted than, and has little in common with, the layman’s 
use of the word; (ii) to dispense with the term altogether 
and adopt some other symbol (such as ‘g’ or ‘brightness’) 
to refer to his own particular meaning; (iii) to retain the 
word in his vocabulary, attributing to it a meaning which is 
comfortably compatible with that of the layman. 

The first course seems indefensible: to take a popular and 
relatively unambiguous word, denude and disguise it and 
try to restrict its circulation, is to invite misuse the moment 
it makes its way back to the outside world—which it is 
bound to do sooner or later. 

The second course would be unexceptionable—would 
indeed possess many advantages—if the man in the street 
and the man in the laboratory knew their respective places 
and kept to them. However, they are apt to meet, say in 


1 Cf. Boring, E. G., ‘Intelligence as the Tests Test Iv’ (1923), New 
Republic, 35, pp. 35-7. 
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the public house, with the result that, after some little time 
lag, both get the worst of both worlds. As suggested in the 
last chapter, the exact, narrow meaning of the new symbol 
becomes blurred; it progresses farther and farther in the 
direction of the old, rejected term; yet it masquerades as an 
exact, and in this case a measurable, concept. 

For these reasons the third course has been adopted. I 
hope that my suggestion may be acceptable to the layman, 
the psychometrist and the biologist. Its vagueness is an 
advantage—indeed a necessity. It invites the questions: 
What are ‘essentials’ and ‘appropriate responses’? Do not 
both of these depend on the particular circumstances? 
What of the individual who grasps the essentials but fails to 
respond, or responds inappropriately? How do we know 
whether the individual who responds appropriately has 
grasped the essentials? Are essentials and inessentials 
fundamentally different? Surely this definition does not 
indicate what would in fact be the most intelligent behaviour 
in such and such a situation? 

: These points are well justified but they are not necessarily 
indictments. That they arise is due to my belief in the 
flexible and the variable nature of intelligent behaviour, 
especially if the word is to be applied to situations outside 
the mental testing room. Intelligent activity has no set 
rules, What is essential depends on the context and on the 
individual who finds himself in it. What is appropriate 
behaviour for a tall, heavily built, slow-moving army officer 
candidate in a W.O.S.B. ‘stress’ situation may well be 
inappropriate for his small lithe competitor. Individual A, 
who is thought to have grasped the essentials in some problem 
situation which involves him, but does not respond in one or 
other of the appropriate ways is in fact behaving less intelli- 
gently than individual B who not only sees how to solve the 
problem but actually does so. Certainly, it may be that A 
was uninterested or emotionally disturbed and, therefore, 
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failed to do himself justice: his behaviour in that situation was, 
for whatever reason, less intelligent than that of B. : 

The behaviouristic second half of my definition is indis- 
pensable if the term ‘intelligence’ is to apply, among other 
things, to that which intelligence tests assess. Nor is it 
peculiar to mental testers: it is common when judging some- 
one’s intelligence to be guided by his observable behaviour. 
Perhaps the only frequent exception to this practice is in 
assessing one’s own intelligence! 

My suggestion implies no presuppositions as to the unitary, 
and strictly measurable, nature of intelligence nor as to the 
existence or non-existence of qualitative differences. It 
would, for instance, be in accordance with it to include such 
concepts as ‘social intelligence’ or ‘creative intelligence’, 
since in certain situations the essentials to be grasped and the 
appropriate behaviour are predominantly of a social or 
creative kind, 

Finally, it is intended to suggest that intelligent behaviour 
is not easily distinguished from instinctive behaviour; that, 
in so far as it is distinguishable, the difference is of degree 
rather than kind; and, it follows, that intelligence is not the 
exclusive possession of man. The implications of this view 
are discussed more fully in Chapters XI, XII and XIII. 


Chapter Four 


THE CONCEPTS OF MENTAL AGE AND INTELLIGENCE 
QUOTIENT 


A review of the material collected by testers of intelligence 
suggests a number of underlying presuppositions, some of 
which I should like to criticize. These presuppositions have 
sometimes been publicly discussed and discredited, yet most 
of the work on intelligence testing continues implicitly to 
assume their truth. First is the evident belief that there 
exists some one attribute ‘intelligence’, which is one- 
dimensional, measurable, capable of quantitative but not 
qualitative differences and, therefore, capable of strictly 
quantitative comparison. Secondly, the belief may be found 
that this attribute is diagnosable as a potentiality (not merely 
as observably manifested): that the individual whose high 
degree of intelligence is at present potential only, is identifi- 
able by means of tests. 

A further assumption, important to the theory of intelli- 
gence testing, is that intelligence is normally distributed 
throughout the total population. This, however, differs in 
several respects from the assumptions considered above. 
In the first place it is explicitly recognized—though not 
perhaps as an assumption: the normal distribution of intelli- 
gence (excluding mental defectives of certain grades) is often 
hailed as a scientific discovery—despite common knowledge 
that frequency distributions on any test depend mainly on 
the particular system of scoring adopted and that a compe- 
tent mathematician can, should he so desire, reduce almost 
any distribution to normality. 

It is of course an asset to have normally distributed 
Scores, whatever the quality to be assessed: many of the 
Statistical techniques used in manipulating test scores require 

33 


34 Appraisal of Intelligence 


distributions which are roughly symmetrical and at least 
approach normality. For this reason alone it is often 
justifiable to adopt a scoring system or to impose a time limit, 
to arrange the questions or to select the groups to be tested in 
such a way that the scores do form a near-normal distribution. 
These procedures are permissible within limits: in practice it 
is, after all, a question of weighing the merits of one arbitrary 
decision against another. It should be borne in mind, how- 
ever, that the greater the departure from the original ‘crude 

scores, the greater will be the degree of arbitrariness and the 
more obscure what exactly is being done. 

However, it is not permissible in my view to infer from 
test results, as an interesting and instructive truth, that 
intelligence is normally distributed. Indeed it is doubtful 
whether such results throw much light on mental phenomena 
—apart, perhaps, from those of the test-devisers and inter- 
preters. The analogy between the normal distribution of 
height, for instance, and that of intelligence does not hold, 
since when measuring height there is relatively little doubt 
either as to what is being assessed or as to the significance 
and defensibility of the units of measurement. When 
‘measuring’ intelligence, however, these theoretical diffi- 
culties present themselves in addition to innumerable 
practical ones: Test 1, which yields a near-normal distri- 
bution with a random sample of the population of A, yields 
a highly skewed distribution when applied to an equally 
random sample of the people of B; Test 2, equally well 
validated, produces a most unhealthy-looking distribution 
when given to the people of A; but Test 2, administered with- 
out a time limit to a different random sample of B; yields a 
symmetrical (though somewhat flattened) curve; and so on. 

The concept of Intelligence Quotient which I propose to 
discuss is related to the ‘fact’ of normal distribution, Since, 
however, the I.Q. was originally the ratio of mental age to 
chronological age, I must first briefly consider these two, 
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separately. ‘Chronological age’ presents no difficulty. It is 
faintly reminiscent of the expression ‘shell egg’: in a sane 
world, it might be argued, age that was unchronological and 
shell-less eggs would have a Learful or Carrollian significance 
only. In any case, in countries where every newborn baby 
automatically acquires a birth certificate, it is seldom that 
any difficulty is experienced in determining the chronological 
age of a subject. 

Mental age is less simple. In the case of a child who solves 
all the problems, for instance, for Year 8 (and also Year 7) 
but who fails on all the problems for Years g, 10 and 11, it is 
quite straightforward: he has, by definition, a mental age of 
8. Such a simple cut-off is of course rarely found. There is 
usually a considerable overlap between the various age 
levels, and there is some difference of opinion among 
psychometrists as to the most satisfactory way of computing 
the odd extra ‘months of mental age’ and of interpreting 
marked anomalies in test performance. 

Binet took as his criterion, success in passing all but one of 
the problems of a given Year; he then added on so many 
months of mental age for each problem of an older year 
group which was successfully solved. This was not a very 
exact measure, since he did not have the same number of 
problems in every year group and he had not standardized 
the test to get similar distributions of scores at all ages. 
However, the revisers of the Binet-Simon Test Scale, notably 
Burt,! and Terman and Merrill,? have attended to these 
points. They have standardized the test on large represen- 
tative groups; and they present a uniform number of prob- 
lems (six) at each age level. Thus the amount to add on for 
each separate problem correctly solved is always two months. 

The question as to what proportion of children should solve 


z 1 Burt, C., Mental and Scholastic Tests (1927), London: P. S. King and 
on. 

2 Terman, L. M. and Merrill, M. A., Measuring Intelligence (1937), 
G. G. Harrap. 


36 Appraisal of Intelligence 


the appropriate set of test problems is complicated and 
controversial. What proportion of children between 4 and 
5 (or 44 and 53), for example, should successfully complete 
all the 5-year level tests? The answer has been variously 
given as 50, 55, 66, 75 and 100 per cent.! The least arbitrary 
of these figures seems to be 50 per cent. On this criterion 
as many of the group will pass the 5-year level tests as will 
fail them: the median mental age for a random sample of 
5-year-olds will be 5 years. In this event, however, the mean 
mental age for such a random sample will not necessarily 
be 5 years. 

A further difficulty confronting the psychometrist who 
expresses his results in terms of mental age is that of rate of 
development. Apart from individual variation in this 
respect (which has tended to be ignored or minimized in the 
past) it is well known that very young children develop at a 
faster rate than older children, Terman and Merrill in their 
Revision have allowed for’ this by arranging the sets of 
problems for the 2-year-olds to 5-year-olds in 6-monthly 
instead of yearly intervals with suitable adjustments to the 
scoring. At the adult end of the scale, the system of scoring 
is adjusted to allow for expansion when testing adult subjects 
or highly intelligent adolescents: it is thus theoretically 


possible to gain a score equivalent to a mental age of 22 years 
10 months! 


(or 15 or even 14) years of age; therefore, chronological age 


ge may continue to develop 
up to 22 years 10 months. 


Apart from the statistical and logical objections to this 
1 Burt, op. cit, (1927), pp. 138-41. 
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procedure (which cannot be satisfactorily defended on 
grounds of convenience and convention), there are serious 
psychological objections to the belief that mental develop- 
ment stops at about 15 or 16 years—or even that the kind of 
mental ability assessed by intelligence tests stops as early as 
that. It is only recently that tests have been devised which 
provide a challenge to intelligent 15-20-year-olds; and on 
the present evidence it looks quite likely that the scores of 
subjects within those age groups may continue to rise, though 
the rise would naturally be gentler than that found with 
younger subjects. 

The types of problem used in the Terman-Merrill Revi- 
sion are unsuitable for late adolescents and adults with 
regard to both interest and level of aspiration.1 (Binet, of 
course, never intended to apply his tests to subjects over 
fourteen or fifteen years ofage.) Thereis some little evidence 
that subjects tend in general to ‘rise to the occasion’—but to 
rise so far and no farther: that is, unless their challenge is 
increased they are unlikely to extend themselves fully.” 
This may well be one of the reasons for the failure until 
recently to find any increase in test scores after about 
fifteen years. To say that lack of scope does not enter until 
subjects are gaining 100 per cent on the tasks assigned to 
them is not a convincing answer. 

Finally the term ‘mental age’ itself is questionable even 
when applied to the 2-14 age range. It suggests some esti- 
mate of maturity and we have all met the bright but 
childish 10-year-old of whom it means something to say that 
he is ‘intelligent but young for his age’, no less than the dull 
child who yet manages to behave in a fairly adult manner. 
This latter is often the case with a physically well-developed 


1 Lewin, K., etal. Chapter on ‘Level of Aspiration’ in Personality and 
the Behaviour Disorders, ed. H. McV. Hunt (1944). New York: Ronald 
Press Co. 

? Cane, V. R. and Heim, A. W., ‘The Effects of Repeated Retesting: 
IIP (1950), Quart. F. exp. Psychol., 11, 4. 
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child, whose mature social behaviour derives from the fact 
that people treat him as older than companions of his own 
age; and that children—perhaps to a still greater extent 
than adults—tend to become as they are treated. 

Binet himself was well aware of this sort of difficulty. He 
raised the question, for instance, whether an intelligent 
9-year-old with a computed mental age of 11} is the mental 
equivalent of a dull 1 3-year-old or of an average child of 114 
—and answered in the negative. His successors, less clinically- 
minded, have also been less cautious. They have not only 
accepted the concept of mental age in a more rigid way than 
Binet could have ever approved: they have pushed it 
considerably farther and, in the process of translating it into 
terms of I.Q. and extending its application to adults, they 
have rendered an originally useful idea almost meaningless. 

It is true that some contemporary mental testers such as 
Wechsler discard mental age, as outlined above, and cal- 
culate the I.Q. of the subject, whether adult or child, by 
comparing his test score with the expected mean score of 
members of his age group.1 This method is not directly 
open to the objections to mental age which I have raised. It 
+ seems to me a pity, however, to persist in the use of the term 

I.Q., partly because this still signifies for many an extension 
of mental age and partly because there are, I think, a number 


of objections which apply specifically to the concept of I.Q. 
and to its implications and uses, 


Its advantages ar 
It is very convenient to h 


1 (a) Wechsler, D., Measurement 
Poychol. Cay, (ob Wechsler, D., Intelligence Scale Sor Children (1949), 
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absolute significance both in the sense of applying with equal 
cogency to all members of the population and of remaining, 
within narrow limits, constant for the individual through- 
out his life—with the consequent implication that his 
environment will have little effect on his I.Q.; an expression 
into which all other means of assessing mental capacity can 
be (and often are) translated. 

It is not hard to find explicit disclaimers of many of these 
‘advantages’. In fact, most psychometrists can point to 
isolated paragraphs in their writings which question one or 
other of them. However, in general, a thoroughgoing 
behaviourist, observing the writers and their readers, would 
be justified in assuming complete acceptance on all sides. 
It therefore seems worth our while to discuss these attributes 
of the I.Q. in some detail. 

Let us begin with the question of the alleged constancy of 
the I.Q. Everyone would agree that it is constant only 
within certain limits but the position of the limits is contro- 
versial. The figures given by psychometrists vary with their 
degree of ego-involvement. It is often assumed in practice 
that the I.Q. of the normal individual remains sufficiently 
unchanged by time, by practice and by experience generally _ 
to justify one testing only. It is sometimes pointed out that ` 
the greatest discrepancies tend to occur towards the two 
extremes—with the near defectives and the outstandingly 
bright children: this is true for both statistical and psycho- 
logical reasons. Rarely, if ever, are the questions of 
differential rates of development and of practice effects 
openly and fully discussed. For example, Burt in his review 
of The Trend of Scottish Intelligence’ devotes three short para- 
graphs to the effects of familiarity with intelligence tests and 
of coaching in intelligence tests. In the middle one of these 
three occurs the sentence: ‘In those areas where intelligence 
testing is a routine procedure, the 1947 pupils showed a rise 

1 Burt, C. (1950), Brit. J. educ. Psychol., Xx, 1, pp. 55-61. 
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of 3-2 points over the score obtained by their predecessors in 
1932; on the other hand, pupils who had been subjected to 
no such tests, at least during the preceding school session, 
showed a rise of only 0-4 points.’ It is regrettable that in 
many of the discussions of the question of the trend of 
intelligence, this point has not been mentioned, 

The question of the constancy of the I.Q. is of course much 
more than a question of the effects of practice, differential 
or otherwise, although it may be worth noting how often 
these effects have been dismissed as negligible. Equally 
important is the question of inconsistency within the indivi- 
dual—inconsistency which may as easily result in deteriora- 
tion of performance as in improvement. Certain crises, such 
as illness or the birth of a sibling or the onset of puberty may 
well accelerate or retard mental development so that a 
subject’s position one year with respect to his fellows may, 
correctly, be assessed very differently from his position the 
preceding year;! whilst removing from country to town, for 
example, may produce a startling leap in a child’s 1.Q. 

My next criticism is of the os 
application ofthe I.Q. Psychometri 
the intelligence of an individual in 
his age (although they may allow i 
‘intelligence declines? after abo 
lived? and however his mental c 
first place. Scores on a test whi 
acquaintance with mental age 
thereby made apparently co) 
The facts that the standard 
ranges from 12 to 173 


tensible universality of 
sts unhesitatingly express 
terms of I.Q.., whatever 
n scoring for the fact that 
ut thirty), whenever he 
‘apacity was assessed in the 
ch has not even a nodding 
are translated into I.Q. and 
mparable one with another. 
deviation of the Binet Test 
according to its exponent, that it 


1 Honzik, M. P., MacFarlane, J. W. and Allen, L., ‘Stability of 
Mental Test Performance between Two and Eighteen Years’ (1948), 
J. exp. Educ. 

2 Cox: È. M., Genetic Studies of Genius, vol. 1 (1926), Stanford U. P. 

? Penrose, L. S., The Biology of Mental Defect (1949), p. 25. Sidgwick 
and Jackson. 
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always varies with age groups,! that the two populations 
may not be random samples of the same universe and that the 
levels of the tests may not be similar, are often ignored. 

Related to these questions is the difficulty of getting 
random samples of the population on which to establish I.Q. 
levels in the first place. The original groups are not always 
truly representative of the total population. Yet once a set 
of norms has been established by any one tester, others are 
liable to base their validation of their test on his findings and 
cumulative errors are likely to flourish unrecognized. The 
difficulty of getting random samples is always present in an 
attempt to establish norms of any kind. But the necessity 
for having random samples, and the implication that they 
have been attained, is greater when test results are expressed 
in terms of I.Q. than in terms of, for instance, percentile 
status. (See end of this chapter.) 

Of recent years it has been realized that a test which 
satisfactorily classifies the members of an unselected group 
according to their mental capacity is unlikely to discriminate 
at all comparably between members of a highly selected 
group (whether selected ‘upwards’ or ‘downwards’).* One 
result of this recognition has been the devising of certain 
high-grade tests, such as Matrices 1947, AH 5 and many 
CISSB tests, which are intended for application to subjects 
of very high intelligence.” These tests yield a reasonably 
wide range when given to appropriate groups—possibly a 
wider range than when given to random samples—and they 
certainly produce a very different, more nearly normal, 
frequency distribution from that yielded by a highly 


1 Tizard, J., ‘The Abilities of Adolescent and Adult High Grade 
Mental Defectives’ (1950), J. ment. Sci., xcvi, No. 405, P- 897. 
2 Heim, A. W. and Batts, V., ‘Upward and Downward Selection in 
Intelligence Testing’ (1948), Brit. J. Psychol., XXXIX, 1. 
3 Some of these are described in an article by Heim, A. W., ‘Recent 
Pa TS in Intelligence Testing’ (1948), Quart: Bull. Brit. Psychol. 
0C., 1, 2. 
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intelligent group on a test devised for unselected subjects. 
In view of these findings it is clearly not permissible to 
translate scores from such tests into terms of I.Q. Apart from 
the statistical objections, little that is relevant is known 
about subjects with an I.Q. of 150+; there has been a 
tendency to label them ‘genius’ and to leave it at that. 7 

Finally, the simple numerical expression of an individual’s 
intelligence as somewhere between 50 and 150 invites the 
unscrupulous and the ignorant to take these figures at their 
face value—disregarding the fact that they represent ratios 
—and to infer that the subject with an LQ. of 135 is half as 
intelligent again as the one with an I.Q.ofgo. People rarely 
express themselves as crudely as this but many of the 
pronouncements of the test users if followed up would lead 
to such absurdities—which are none the less dangerous for 
being meaningless. 

I should suggest that intelligence test scores be expressed 
in terms of percentile status, giving always details about the 
particular test used and the size and the constitution of the 
group on which the norms were established. This would 
retain the advantage of simplicity and ready comprehension 
by the non-statistically-minded. The inclusion of data 
concerning the test and the group would serve to underline 
the limits of applicability and would imply no assumption 
as to constancy, innateness or translatability. 

This method of expressing test results has been advocated 
by many, and adopted by some, workers in the field. There 
are some, however, who believe that this method produces 
more difficulties than advantages. The alleged objections to 
using percentiles include the following: (i) that their use 
makes the combination of test scores difficult; (ii) that many 
of the relevant statistical techniques cannot be applied to 
percentiles; (iii) that the layman does not readily understand 
percentiles, 


These three statements are true but it is not obvious that 


ee 
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they constitute objections to the use of percentiles. A com- 
bined score has less meaning than has a series of single scores: 
it presents a less vivid picture of the subject and, however 
combined, a less faithful rendering of the results of the tests. 
The argument about statistical techniques provides a good 
example of the way these have become an end rather than 
a means in mental testing. Instead of saying ‘Let us decide 
on what is psychologically the most satisfactory way of 
expressing test scores and let us then employ, or if necessary 
devise, suitable statistical methods of manipulation’, the 
psychometrist tends to employ the existing methods, grate- 
fully accepting any refinements which the statistician may 
offer, and excluding forms of expression which do not lend 
themselves to these methods. 

Lastly, that the layman may not understand the meaning 
of percentiles seems to me as much an asset as a drawback. 
The facility with which he understands, or at least uses, 
1.Q.’s in practice has not made for unmitigated good. If 
scores were expressed in terms of percentile status, the lay- 
man would require some explanation, in the course of which 
he might come to recognize some of the problems of mental 
testing, and thereafter he might use test results more 
cautiously and flexibly. 

There are, however, certain more positive advantages to 
be gained from expressing test norms in terms of percentiles. 
The practice makes clear how relative are all the data—the 
norms, the consistency, the validation. The user can decide 
where to draw the line between the percentiles, whether to 
give ‘grades’ and, if so, what percentages to include for 
example in his extreme and his middle grades. He can see 
from the frequency distribution of the scores what divisions 
would be justified and what would constitute laying undue 
weight on small differences. 

Percentiles have universal application, in the sense that 
‘Goth percentile’ always means ‘better than 60 per cent of 
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the subjects on that test’. They are not, therefore, open to 
the objections of giving a test result simply in terms of crude 
score; this has, of course, a widely different significance for 
different tests. This latter practice would be of little use but 
it would not mislead since no one would labour under the 
impression that a score of 58 on Progressive Matrices 1938 
for instance is equivalent to 58 on the Wechsler-Bellevue or 
the Cattell III tests. Moreover, the employment of percen- 
tiles demands that details of the group be included: 6oth 
percentile on what population? There is no tacit assumption 
that the norms have been established on a random sample of 
adults. s 

To sum up: chronological age is a valuable criterion 
against which to standardize and validate intelligence tests 
for children.+ Unlike most criteria, it is objective, self- 
consistent and entirely independent of the tests. Mental age 
and I.Q., both of which are largely determined by chrono- 
logical age, have their uses for children provided that the 
testers and interpreters bear in mind the pitfalls. Neither 
chronological age nor mental age can be usefully employed 
when assessing the intelligence of adults. Itis suggested that 
both adult and child intelligence test norms be expressed in 
terms of percentile status since this method of presentation 
is the least misleading and the most informative. 


1 For further discussion of chronological age as a criterion, see Chap. 
VIII, pp. 106-9. 


Chapter Five 


THE APPROACH OF THE FACTOR ANALYSTS 


This chapter consists mainly of destructive criticism. What 
constructive suggestions are made in this book are largely 
confined to the last three Chapters. It may be objected that 
many of my criticisms apply to particular factorial practi- 
tioners rather than to factor analysis itself; it may be further 
objected that criticisms of a technique come best from those 
who are fully cognizant of its character and experienced in 
its use and that neither of these qualifications applies to me. 
Both these objections are warranted. 

What justification have I then for producing such a 
chapter? It is threefold. (i) Although many of the practices 
I decry are attributable to individual exponents rather than 
to the theory of factor analysis as such, there is a great, and 
growing, number of such individuals. They are more 
prolific than the few wise and cautious factor analysts— 
whose claims are modest and whose conclusions are of 
value. (ii) My criticisms are primarily psychological and 
Philosophical. I have strayed scarcely at all into the realm 
of statistical criticism: other writers have already done so, 
far more adequately than I could. (iii) I have added an 
appendix to this chapter in which the principles and the 
rationale of factor analysis are outlined, lucidly and 
sympathetically, by P. E. Vernon. I should advise anyone 
who is unacquainted with the subject to read this appendix 
before reading my chapter. 


To those of the factorial persuasion it must appear strange 
to devote to it one isolated chapter in a book on the appraisal 
of intelligence and to refer to it scarcely at all before or after 
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that chapter. For the factor analyst holds, on the keys of his 
_ calculating machine, the solutions to the problems which 
puzzle the unconverted: What do you mean by ‘intelligence 
(or ‘memory’ or ‘perseveration’)? How do you know what 
you are testing? How do you validate your tests? Ee 
The answer is simple: ‘by means of factor analysis’— 
though the matter is rarely stated thus simply and uncon 
troversially. The specific techniques of factor analysis are, 
of course, far from simple and very far from uncontroversial. 
But a majority of the factorists appear agreed that a clear- 
cut key to these problems exists and is in their hands: they 
differ only as to whether the key be a Yale or an Ingersoll, 
a mere combination number or a tumbler lock requiring 
much rotation.! As Loevinger states, ‘By far the most 
widely accepted solution to the problem [of intelligence] is 
that of the factorists,’2 
The factor analysts use their findings for purposes of both 
validation’ and definition. In a given battery of tests, the 
more highly saturated a test with the general (or first) factor, 
the more validly does it test 8- What is g? It is that which 
the best intelligence tests, on this criterion, measure or— 
which comes to the same thing—that which causes nearly all 
mental tests to intercorrelate as they do. The technique was 
originally, and still could be, a useful mathematical device. 
Used prim: sproving hypotheses 
gesting new principles of classification, it could be 
ted above, however, 


1 For a survey of factor analytical literature, see ‘Factor Analysis to 


‘sychometric Monographs No. 3, Univ. of 


2 See page 595, in Helson’s Theoretical Foundations of Psychol 1) 
? E.g., page 4 of Conference on Factorial Studi itude pn? (1951). 
Measures (1952), Educational Testing Serice: Aptitude and peerage 
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have often tended to be opponents who combine distrust and 
distaste with their lack of understanding of the technique. | 

Factor analysis consists of computation and manipulation 
of matrices of correlation coefficients. In psychology; these 
are usually obtained by correlating mental test scores one 
with another. The intercorrelations may, however, be 
between any measures—tests, ratings, examinations or even 
subjects, or some combination of these. The first matrix is 
calculated, that is, each measure is correlated with every _ 
other measure in turn. Let us assume, in this instance, that 
the number of tests and subjects are adequate, that there are 
no gross errors of sampling, that the regressions are linear 
and that the test distributions are normal or have been 
transformed to normality. The matrix is then worked on 
according to whichever system the particular factor analyst 
favours, to determine, for instance, whether there is a factor 
common to all or most of the measures used, how far each 
of the measures possesses it—and similarly for the lesser 
factors, known as second and third, etc. The factor which 
accounts for the greatest part of the variance (which is 
generally the one for which most tests have positive loadings) 
is known as the first factor. 

If this were all, we could experience only gratitude for a 
method which enables us to express in quantitative terms 
the conclusions which we are tempted to draw from inspec- 
tion of the correlation matrix. But the factorist goes further. 
Rarely is he content to express his findings in terms of ‘a 
first factor using the X method, on such and such a test 
battery, with a group of n subjects selected in such and such 
away... .’ Statements of this kind would be innocuous and 
Perhaps informative, even if axes had been rotated in- 
definitely to get new factors. Many factorists, however, like 
to substitute for this cumbersome phrase the brief, less 
accurate, term ‘general intelligence’ or, still more briefly, g; 
and therein lie my major criticisms. As Loevinger puts it: 
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‘All this computational labour—and it is usually a lot—to 
obtain intuitive names for hypothetical factors of mind.’ 
How do you know, with a given battery of tests, that the 
first factor (or the most important factor after rotation) is 
‘intelligence’? We shall discuss later the more fundamental 
question of whether the attempt to split the mind into a 
number of discrete modern factors or ancient faculties is 
profitable. Here we are concerned with the connotation of 
the first factor, provisionally accepting the general notion of 
separate isolatable mental attributes. If indeed the factor 
does denote anything with psychological reality, what this 
is will depend entirely on the particular battery of tests, the 
particular group of subjects and the particular statistical 
technique favoured. The second of these points is often for- 
gotten: the critics of factor analysis (and sometimes the 
factorists themselves) stress the dependence of the findings 
on the tests used, and of the conclusions on the statistical 
methods and psychological predilections of the factorist. But 
they often ignore the fact that it is the interaction between 
tests and subjects which provides them with their data. 
Again, psychometrists tend to speak (both pre--and post- 
factor analysis) of Test A as a ‘speed test’ and Test B as a 
‘power test’; of the ‘reliability of Test C’ and the ‘validity of 
Test D’—in vacuo, as though these terms divorced from the 
groups which produced the scores had some immutable 
meaning. Yet no-one is better placed than the factorist to 
observe the changing shape of the distribution curve, the 
alteration in variance and the difference in test-retest 
consistency when the same test is applied to different groups. 
It may be said that once the notion of factors is accepted 
the question ‘how do you know the first factor is “‘intelli- 
gence”? is academic. We are variously told: (i) It does not 


1 Loevinger, Jane. Chapter on intelligence in Theoretical Foundations of 
Psychology, Helson, Harry (1951). This chapter contains a good discus- 
sion on the ‘objectivity’ of factor analysis as used in psychology. 
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matter what name you give it; call it g or x or intelligence or 
brighiness. What is denoted is the factor which the test 
battery is shown to measure. No claim is made as to the 
existence of this (or any other) factor as a psychological 
entity. (ii) It is self-evident that a factor possessed in 
common by a lot of different tests must be a general factor, 
that is, general mental ability or intelligence, and the finding + 
of it is evidence of its existence. 

Examination of factor analytical work reveals the unsatis-_ 
factoriness of these attitudes. (i) Would be unexceptionable 
(though of little relevance to psychological theory) if the 
factorists were consistent in it. But in practice it is virtually 
impossible to maintain this attitude permanently. Even if 
this agnostic creed he stated in Chapter I and maintained 
in Chapter II, by Chapter III the first factor will be tacitly 
equated with ‘intelligence’ and ‘intelligence’ will be found to 
have most of the associations which it has for the layman. 

The second attitude is expressed far less frequently than 
the first, but it is often implicit in factorial ‘interpretations’. 
It is not defensible since the finding in question is a statistical 
artifact: -all tests in a battery will tend to be positively 
correlated whilst it is impossible for all the tests to have 
negative intercorrelations. By a kind of dissociative process 
both attitudes are sometimes found, at different times, in 
one and the same writer. 

The naming of tests and of factors is far from being of 
merely academic interest, in view of the practice among 
factorists of taking these names at their face value. This 
naming is fairly uniform among factorists who use the same 
method; a certain amount of natural criticism is made, 
however, of members of one school by members of another, 
regarding the names of specific tests and factors. This is apt 
to provide interminable entertainment since the contestants 
do not question one anothers’ fundamental premises and 
argue only about conclusions—which rarely can be 
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confirmed or refuted at that level. The same difficulties apply 
toa greater degree to the ‘identification’ of specific aptitudes; 
and greatest of all are those encountered in the field of 
personality when the meaning of the mental traits and their 
intrinsic reliability are exceedingly questionable. 

An illustration may serve to clarify some of the points that 
have been raised. Let us take a controversy over g versus k 
which arose as a result of a piece of factor analysis and which 
the contestants attempted to resolve by means of factor 
analysis. I select this particular example from a very large 
choice since the main point at issue is a practical and a 
psychological one, since the example concerns both general 
intelligence and a specific factor and since it illustrates the 
inevitable reification of factors at some stage—not to be 
confused with the stage at which they are named. 

In 1943, Slater! published the results of an inquiry he had 
conducted on the question as to the age at which specific 
abilities develop in children. He was interested in spatial 
perception, or k, tests as one of the most important means of 
discriminating between the more scholastic or theoretically- 
minded school-children on the one hand and the more 
practical or technically-minded children on the other. As a 
result of his tests and his factor analysis of the results, he 
reached the conclusion that there is little or no difference 
on the specific ability k at 11 or even at 13 years of age; 
and, on the basis of this conclusion, he advocated that the 
divisions between types of school be less rigid and that 


children be allocated to them at a later age than at 
present. 


Some years later Adcock 
Journal, in which he descri 
data and arrived at the o 


published a reply, in the same 
bed how he had re-analysed the 


PPosite conclusion from that of 
* Slater, P., ‘The Development of Spat; i i 
to some Educational Problems (1943 a Pe panikoa 


), Occup. Psychol., xvir, 3 
® Adcock, C., ‘A Re-Analysis of Slater’: 2 > y8: 5 
(1948), Occup. Psychol., xxn, a ater’s Spatial Judgment Research 


The Approach of the Factor Analysts 51 


Slater. The following excerpt is drawn from Adcock’s con- 
cluding paragraph: ‘We have now analysed our data by 
three methods in addition to the cluster analysis and there 
seems to be no doubt of the presence of k in the tests selected. 
There is no reason to suspect that the inclusion of all the 
tests would give different results, and the data for the 11-year 
children appears to be so similar that we could expect the 
same results from it. . . . In view of the educational implica- 
tions of this research we feel that this re-interpretation should 
be widely known.’ 

Thus both writers are acutely aware of the importance of 
the problem from an educational point of view. Yet both 
evidently believe that the solution lies in analysis and re- 
analysis of a given collection of data, Adcock even going so 
far as to infer the information about 11-year-olds from his 
controversial findings about 13-year-olds. He is no doubt 
aware that psychologists with sufficient statistical myopia 
may consider that they have ‘reason to suspect that’ the 
I1-year-olds might produce results which differ from those 
of the 13-year-olds. 

Some months later Slater published his brief reply.1 He 
was quite unshaken by Adcock’s proof of ‘undoubted 
presence of k in the tests selected’. Slater explained his lack 
of conviction on both statistical and psychological grounds. 
He stated that the finding of a three-factor solution (as 
Adcock had done) ‘does not . . .diminish the significance of 
the finding that a two-factor solution was adequate’: the 
original experiment had been designed so that if g, v and k 
were all functioning, a two-factor solution could not have 
been fitted. Moreover he considered that Adcock had found 
himself ‘compelled to introduce an ad hoc definition of k 
which seemed [to Slater] psychologically unacceptable’. 
This last criticism (on a point which characteristically had 
not previously been mentioned by either side) is always open 

1 “Mr. Slater replies to Dr. Adcock’ (1949), Occup. Psychol., xxi, 2. 
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to any factorist who disagrees with the findings of a fellow 
ist. ‘ 

as not concerned here to assess the merits of the hai: 
statistical methods which have been or which might be ee 
on these data, even were I competent to do so. It is ci S ; 
that conclusion x or non-x can be extracted given the ape | 
the technique, the inclination and the time. I wish to i in 
trate the absurdity of thinking to gain new information p 
pitting one method of factor analysis against anotiet ay 
the necessity, in such an instance, for further research. oe 
Series of factor analyses, however refined, can take the p. nef 
of (i) defining the question and (ii) conducting a oS 3 
experiments to answer the question. When the latter s 
been performed a factor analysis of their results may or Ea 
not prove fruitful to verify specific, and previously stated, 
hypotheses. 

"The most desirable plan in this controversy would have 
been to allocate experimental groups of children aged eleven, 
twelve, thirteen and fourteen to different schools and to 
follow up their Progress. If, however, the contestants E 
sidered such an undertaking as outside their sphere, it woul 
still have been possible, in the interval between the first two 
Papers quoted above, to conduct a field experiment which 
would at worst have disproved the contrary of the original 
hypotheses, at best have confirmed the hypotheses. The point 


at issue (somewhat oversimplified) is this: ‘Have adults a 
specific aptitude k which youn, i 


what sort of age do children begi 


The Approach of the Factor Analysts 53 


is, the g and k scores would intercorrelate as highly or nearly 
as highly as their g scores with one another and their k 
scores with one another. 

It is, in fact, not difficult to find tests agreed by most 
psychologists to assess intelligence, and other tests agreed to 
assess spatial perception, provided the judges are not 
immersed in some relevant dispute at the time. It would 
therefore be possible to compound the necessary battery and 
give it, suitably balanced, to groups of 10-year-olds, 11-year- 
olds... 18-year-olds and also to a group of 19- to 25-year- 
olds. 

_ This brief outline of one of the possible experiments over- 
simplifies the problem a good deal. It fails, for instance, to 
take into account differences in general mental /evel—with 
consequent differences in range and distribution of the scores 
of the various groups. It might be found necessary to test 
several groups of adults, of varying mental levels (assessed 
perhaps by occupation or by education) before a meaningful 
comparison could be made. It might well be found that all 
the correlations of the younger children were low simply 
because the tests were too difficult and they therefore yielded 
very small variances, in the same way that a highly selected 
group does on most tests. My aim is merely to demonstrate 
some of the limitations of factor analysis as a means of gain- 
ing new psychological information and the liability of 
factorists to limit their search to this narrow path. 

The unconverted sometimes put their criticisms in the 
form: ‘You can only get out of a factor analysis what you 
put in.’ Whilst this is of course true, in a sense, it is true also 
to say that you can, if highly skilled in the craft, get out of it 
whatever you choose. You avoid time-consuming and ‘sub- 
jective’ considerations such as the technique of testing, 
differential interest among subjects, discussion of what 
introspection and observation suggest that the tests are 
assessing, and you concentrate on ‘objective’ and exact 
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methods which can in the final stages be induced to yield 
refined results. p 
Another example may serve to substantiate this point. 
Thurstone in the preface to his Primary Mental Abilities! writes 
as follows: ‘When this study was planned, we postulated a 
number of tentative psychological categories or factors 
which served merely to insure that a wide variety of tests of the 
paper-pencil sort were included. The primary factors that 
appeared have a general relation to the tentative categories 
with which we started, but they are not identical with the 
tentative categories. We had postulated a verbal factor, 
but we found two distinct verbal factors in the analysis. We 
found that the number factor is highly restricted. We had 
postulated different reasoning factors for verbal, numerical 
and spatial material; but this tentative classification was not 
sustained. The reasoning tests revealed two factors that e 
have called “induction” and “deduction”, the latter being 
less clearly indicated than the inductive factor. These 
reasoning factors seem to transcend the immediate character 
of the material of the tests. We had separate tentative cate- 
gories for visualizing in flat space and in solid space, but our 
analysis did not reveal such a division. These tests collapsed 
into a single visual space factor. From the methodical stand- 
point these findings give strength to the factorial methods in that they 
do not merely reproduce the classifications that we had in mind. The 
factorial methods have so far indicated their effectiveness in 
testing psychological hypotheses. It is in this function that 
the factorial methods will justify themselves in experimental 
and theoretical psychology.’ (My italics throughout.) 
Thus, if the findings of a factor analysis are in accordance 
with preliminary ‘tentative psychological’ hypotheses, they 
may be taken to confirm the hypotheses—even if the latter 
were formulated ‘merely to insure that a wide variety of tests 


1 Thurstone, L. L., ‘Primary Mental Abilities? (1943), Psychometric 
Monograph No. 1., Univ. of Chicago Press, 
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+. were included’. If, however, the ‘tentative classification 
is not sustained’, then it may be inferred that ‘factorial 
methods have . . . indicated their effectiveness in testing 
psychological hypotheses’: that the factors yielded do not 
fall in line with the particular tests used, is accepted as proof 
that the ‘factors’ have an important and independent 
psychological significance. 

As in most work of this kind, the problem of whether each 
test assesses anything apart from the ability to do it and, if 
so, what this is, does not receive a fraction of the space or the 
thought devoted to statistical argument. ‘It may even 
happen,’ as Thurstone asserts later in the preface, ‘that such 
abilities as Number, Induction, and Memory may be 
appraised by tachistoscopically presented discriminatory 
tasks that do not contain any numbers, that do not call for 
memorizing in the usual sense, and which do not involve 
inductive or deductive thinking in explicit verbalized form’! 
This is on a par with Eysenck’s hysterics who are no more 
Suggestible than non-hysterics and his unsociable subjects 
who are not introverted.! ‘Wrongly are they called swine’ 
would appear to be the closest classical analogue. 

One more generalization may be based on the excerpt 
from Thurstone’s preface. From the fact that the educed 
factors differed from some of his pre-analysis categories, he 
concluded that you can get more from a factor analysis than 
you put in: ‘These findings give strength to the factorial 
method in that they do not merely reproduce the classifica- 
tion that we had in mind.’ But the difference between the 
pre-and post-analysis categories is relatively trivial, especially 
in view of the carefree way in which the factors found are 
identified. The important points were settled, in accordance 
with unstated assumptions, as soon as the investigation along 
factor analytical lines was planned. The general form of the 
results was predetermined. The specific findings hold little 

1 Eysenck, H. J., Dimensions of Personality (1947), Chaps. 1 and v. 
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interest for the non-factorist since he knows in advance 
what form they must take and he does not accept the 
universality of the relations found between the particular 
test battery and the particular group—nor of the use to 
which the particular experimenter decided to put certain 
words, This is the very important sense in which the 
experimenting factorist only gets out whathe puts in, whether 
or not his post-analysis factors tally with his pre-analysis 
categories. 

There is one further criticism of factor analysis, as applied 
to psychological problems, to which attention has been 
drawn by Babington Smith.1 This concerns the order in 
which the subjects take the tests which yield the scores for 
the factor analysis. It is well known inside and outside the 
psychological laboratory, that the order in which people 
tackle tasks affects their performance on these tasks. This 
has been demonstrated, as between question and question, 
with reference to gradient of difficulty in tests, and my 
colleagues and I have shown the effects on the same test 
taken at regular intervals.* All work demonstrating transfer 
of training effects, positive or negative, provides examples 
of the importance of order. 

In almost all factorial investigations, however, the role of 
order is ignored. It is usual for group tests to be used and for 
the group as a whole to take Test A, followed by Test B, 
followed by Test C, etc. It would be interesting to conduct 
an experiment in which the subjects were split into sub- 
groups, each of which took the tests in a different order. 
Such an experiment might well produce different factors 
from an experiment performed by the same investigator, 
with a similar group and an identical test battery—save that 
the tests were given in uniform order to all subjects. Once 


+ Babington Smith, B., ‘An Evaluation of Factor Analysis from the 
Point of View of a Psychologist’ (1950), 7. R. statist. Soc., Series B., x11, 1. 
2 Cane, V. R. and Heim, A. W., ‘The Effects of Repeated Retesting: 
IP (1950), Quart. F. exp. Psychol., 1, 4. 
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again the findings would depend on the particular in- 
gredients used: with a differentially test-sophisticated group 
they might be quite startling. 

My remaining criticisms will be listed briefly since they 
concern misuse of the method of factor analysis. Inadequate 
attention to testing technique, to group sampling and to 
distributions of scores is not infrequently found. These 
defects may of course be found also in the work of psycho- 
Metrists who are not factorists but they are perhaps less prone 
to ignore these matters since they are often interested in such 
ancillary problems for their own sake and they do not believe 
that objective methods, large groups and a high degree of 
mathematical precision can compensate for neglect of such 
problems. s 

Under ‘testing technique’ I include the stimulation and 
maintenance of motivation, before and after the tests, as 
well as their actual administration. It may be argued that 
these points are not worth bothering about provided the 
group is treated uniformly, but objective uniformity rarely 
achieves uniform psychological results. Errors of sampling 
can be allowed for as can errors in the test, such as low con- 
sistency: in a sense the greater the ‘attenuation’, the more 
Satisfactory the prognosis from the point of view of validity 
(see Chapter VII). Skewed frequency distributions can be 
transformed into normal ones, by one method or another. 
But by the time the scores gained from subjects, some of 
them bored, some nervous, some familiar with tests, some 
eagerly interested, have been transformed to normality, 
intercorrelated, the correlations corrected for attenuation 
and the factors axis-rotated, it is difficult to know just 
what psychological significance to attach to the interpre- 
tations. 

I believe that factor analysis is a potentially useful tool in 

1 Fora comprehensive and witty discussion of this topic, see: McNemar, 
Q., ‘The Factors in Factoring Behaviour’ (1951), Psychometrika, XVI, 4. 
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psychology; it has already given evidence of its possibilities 
in medicine, in agriculture and in economics. As an aid to 
solving problems of classification and as a means of testing 
certain specific hypotheses it is uniquely valuable. However, 
I believe also that the deflection of by far the greater part of 
the work on mental testing into exclusively factorial channels 
and the treatment of factor analysis as an end in itself, a) 
sterile, if it has not actively interfered with progress in 
psychology. 


From Chapter I of P. E. Vernon’s book The Structure of 
Human Abilities. 


MENTAL FACULTIES AND FACTORS 


The faculties or powers of the human mind have been for centuries 
a matter of interest, not only to the ordinary man who wishes to 
explain his own conduct and that of other people, but also to the 
philosopher, psychologist and educationist. Until recent years, 
however, their nature and numbers were matters of pure specula- 
tion. Casual observation and introspection are incapable of 
providing scientific proof of their existence, and in consequence 
many past theories of human abilities and qualities and their 
organization were entirely fallacious. . . , 

Psychologists nowadays tend to adopt a more operational or 
Behaviouristic outlook, though rejecting the wilder excesses of 
J. B. Watson’s doctrines, They realize the fruitlessness of mental 
entities such as faculties, which can never be directly observed nor 
verified, and prefer to deal with concepts directly derived from 
measurable activities of human beings. An ability is inferred from 


the fact that some people carry out certain tasks more rapidly or 
more correctly than ot 
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means of correlation we can find whether the scores of a group 
of people on two or more tasks correspond or not, and therefore 
whether these tasks involve the same, or distinctive, abilities. If 
several tests presumed to measure a particular ability do not 
correlate positively with one another, that ability cannot be 
accepted as a useful conception. Take memory as an example. 
We all know that a schoolboy may have an excellent memory for 
cricket scores or names of motor-cars, and a poor memory for 
school work, and that a professor who remembers everything about 
his own subject may be absent-minded in daily life, or forgetful of 
names and faces. If these various kinds of memory are measured 
and intercorrelated and little or no agreement is found, it is 
obvious that there is no one general faculty of memory, but a lot 
of specific varieties. We need not demand that such tests correlate 
perfectly; they may show a limited amount of overlapping, and 
some may correlate more highly with the rest than others do. But 
only in so far as they do correlate can they be regarded as measur- 
ing a memory ability or factor. Otherwise each test is merely 
measuring the ability specific to that test and to no other. It 
follows too that any test can be regarded as divisible into two 
Portions which we call its communality and its specificity, i.e. what 
it has in common with other tests and what is specific to it 
alone, 

There is yet another possibility. Positive correlations between 
Several tests designed to measure memory might arise if the 
tests were in fact all measuring some other, more fundamental, 
ability, say intelligence. Factorial technique enables us to 
examine this, and to discover whether or not there is overlapping 
over and above anything attributable to intelligence. We thus 
arrive at the definition of an ability given by the writer else- 
where:! ‘It implies the existence of a group or category of per- 
formances which correlate highly with one another, and which 
are relatively distinct from (i.e. give low correlations with) other 
performances.’ 

_ It is unfortunate that this approach to the analysis of abilities 
involves somewhat complicated mathematics, since this frightens 


1 Vernon, P. E., The Measurement of Human Abilities (1940), Univ. of 
London Press. 
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or antagonizes many of the teachers, employers, and others 
who are most prone to discuss abilities unscientifically. Yet the 
basic principles are very simple, as the following hypothetical 
examples will show. 


TABLE I 


CORRELATION COEFFICIENTS BETWEEN SIX 
PSYCHOLOGICAL TESTS 


Tests I 2 3 4 
1. Vocabulary +76 +79 +45 
2. Analogies 7i +68 +44 
3. Classifications +79 +68 +49 
4. Block Design | +45 +44 +49 
5. Spatial +41 +35 +39 +58 
6. Formboard +34 +26 +32 +44 


Table I gives the correlations that might be obtained between 
six tests applied to a large group of children (Block Design and 
Formboard being given individually). Inspection suggests that the 
correlations between the first three and last three are relatively 
small, i.e. that ability at verbal tests is partially distinct from 
ability at practical or spatial tests. But the separation is incom- 
plete. All the correlations are positive, showing that all tests 
have something in common, presumably of the nature of general 
intelligence. By the appropriate techniques we can find how far 
each test measures this general ability or factor which we shall 
call g and Table II lists the loadings, saturations or correla- 
tions with g. Now if this was the only underlying ability, we could 


reproduce the test intercorrelations simply by taking the products 
of their g-loadings. For example: 


Tas = Ts, > Ts, = ‘8 x +5 --40 
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Such products are listed in Table II, and in Table III each 
product has been subtracted from the corresponding original 
correlation to show what overlapping, if any, remains. These 
are known as residual correlations. 


TABLE II 


G-LOADINGS OF THE SIX TESTS AND THEIR 


PRODUCTS 
G-Loadings Products 
it a $ ç $m B 
hao T a eee 
1. Vocabulary 8 56 64 48 40 '32 
2. Analogies 7 ‘56 ‘56 42 :35 28 
3. Classifications 8 64 56 48 40 32 
4. Block Design 6 +48 42 48 3o ee 
5. Spatial 5 ‘40 35 ‘40 °30 20 
6. Formboard 4 992 a a 120 


TABLE III 


RESIDUAL CORRELATIONS AFTER SUBTRACTING 
THE OVERLAPPING ATTRIBUTABLE TO G 


1. Vocabulary +20 +715 |—-03 +701 +02 
2. Analogies +20 +12 |+-02 +'00 —'02 
3. Classifications +15 +2 +'or —"o1 +00 
4. Block Design —o3 +02 +01 +28 +20 
5. Spatial +-or +-00 —-or | +:28 +35 
6. Formboard +02 —-02 +00 |+'20 +35 
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The residuals between the first three and last three tests are 
not all zero, but are so close to it that they can reasonably be 
attributed to chance errors in the original correlations. Within 
each group of three however the residuals are large, showing that 
distinct verbal and practical-spatial abilities are present. Each 
set can be analysed separately, and if the following loadings 


are multiplied out, they exactly reproduce the residual correla- 
tions: 


Verbal-factor Spatial-factor 
loading loading 
1. Vocabulary 5 4. Block Design "4 
2. Analogies 4 5. Spatial Ki 
3. Classifications 8 6, Formboard Ë 


Subsidiary abilities, over and above g, are called group factors 
since they run through a limited group of tests. It is preferable 
to name them by symbols, such as v for verbal, & for spatial, 
rather than giving them ability names which may readily be 
misinterpreted. Similarly we use g to refer to the objectively 
established general factor, instead of the subjective and indefinable 
term intelligence. i 
The communality of any test, i.e. its total factor-content, 18 
shown by the squares of its factor loadings. Table IV lists these 
loadings, their squares, the communality (h?), and what is left 
over from 1-0, i.e. the specificities. Thus we can state that the 
vocabulary test measures 64 per cent g, 25 per cent v, and the 
remaining 11 per cent is specific. The Formboard is a much 
poorer g test, only 16 per cent of what it measures being attribut- 
able to the general factor, 
specific. Such figures are known as the variances of the factors, 
and the average variance of each factor is given in the bottom 
row. These figures represen 
in this battery of six tests. 
From this example it should be clear that a factor is a construct 
which accounts for the objectively determined correlations between 
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TABLE IV 


COMPLETED FACTOR ANALYSIS OF SIX 
PSYCHOLOGICAL TESTS 


Squares of 
Leadings Loadings 

g v k a v k 
1. Vocabulary B “5 6. . 
2. Analogies 7 4 9 18 
3. Classifications | -8 3 ‘b4 109 
4. Block Design 6 ‘4 | °36 16 
5. Spatial 5 -7 | °25 “49 
6. Formboard ‘4 5 | 16 "25 
oa Ea A a Y Ds 


Lo. Variance |da 8:3 15:0 


tests, in contrast to a faculty which is a hypothetical mental 
Power. We can if we wish go on to theorize about the psycho- 


` logical nature and origin of factors. Better, we can conduct 


experiments to discover just what performances involve a factor, 
among which groups of people it emerges, and what conditions 
affect it. But factors should be regarded primarily as categories 
for classifying mental or behavioural performances, rather than as 
entities in the mind or nervous system. Since by means of factor 
analysis we can reduce a large battery of tests to a few under- 
lying factors there is a certain parallel to the analysis of chemical 
compounds into their constituent elements. But this analogy 
should not be pressed too far, for we shall see later that factors 
are much too fluid, too dependent on the particular groups and 
Particular tests studied, to be compared with elements. For ex- 
ample, we might expect, and will indeed find, that the factors in 
scholastic abilities are dependent on how school subjects are 
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taught. Some teachers emphasize the connections between the 
various branches of mathematics, or between a country’s language 
and its literature and history, much more than others do, and this 
is likely to be reflected in the correlations and factors. 

How factors should be identified and named is a somewhat 
controversial point. According to Guilford! the factorist studies 
the common material, formal and functional features in the tests 
which are loaded with a factor and from this deduces its nature. 
Most factors are defined by material (e.g. verbal, mechanical 
information, etc.). The form of the test—whether apparatus or 
paper-and-pencil, choice-response or creative-response—has not 
yet been proven to have much influence. Functional factors 
involve consideration of the testee’s mental processes, by means 
of introspections or job analysis procedures or both (e.g. reasoning, 
attention, etc.). Bentley? and others have criticized the looseness 
of factorists’ terminology, and the subjectivity of their guesses 
about the nature of some of their factors. We agree with him that 
it is better to avoid names of hypothetical functions or faculties, 
but would claim that the old-fashioned procedure (still common 
among some vocational psychologists, psychiatrists, teachers an 
others) of assuming that a faculty exists and that certain tests 
measure it, is very much more subjective. Factorists do not, 1n 
fact, rely on hunches but always try to provide objective con- 
firmation of a factor by carrying out further analyses with other 
populations and with enlarged batteries of tests, with a view to 
defining its content and extent more accurately. 

The mistake should not be made of identifying the whole of 
the psychology of abilities with factor analysis. Vocational and 
educational selection and guidance must take account not only 
of personality traits and interests which might profitably be ex- 
pressed as factors also, but also of relevant experience, home 
circumstances and the like. And although there is a strong case 
for substituting objective tests for the subjective judgments of an 
interviewer, in practice it is seldom possible to carry out such 
guidance without an interviewer to bring together all the data 


1 Guilford, J. P., ‘Human Abilities’ 
* Bentley, M., ‘Factors and Functi 
Amer. J. Psychol., 61, 286-91. 


(1940), Psychol. Rev., 47, 367-74: 
ons in Human Resources’ (1948), 
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and to interpret them to the candidates. Still more important 
for the development of psychological science are experiments on 
conditions affecting the performance of skills and mental tasks, 
for example, investigations of the design of equipment, or studies of 
the learning process, of concept formation, of physical or mental 
fatigue and boredom, and so forth. Here factor analysis is largely 
irrelevant, since it deals only with the end products of human 
thinking and behaviour, and throws little light on how these 
Products come about in individual human beings. Factors are 
indeed a kind of blurred average, for though they derive from the 
common features displayed by a large group of people, they may 
stem from very diverse mental and physical processes in different 
People, Analysis does not even usually tell us which factors an 
individual uses in any given performance, though it probably 
could do so. Thus one individual may score well at a test through 
high g, another might get the same score by virtue of some 
Toop factor, yet another through specific ability at that particular 
est, 

The real need for factors arises as soon as we begin to discuss 
and name abilities or traits, and to compare the relative standing 
of different people on such faculties. Factor analysis is comple- 
mentary, not opposed, to the approach of the experimental 
Psychologist; but both are opposed to the layman’s unscien- 
tific speculations about human qualities and their underlying 
nature, 

It should be realized also that the ‘map’ of the mind so far 
Provided by factor analysis is very incomplete, although it repre- 
Sents a remarkable advance over what was known at the begin- 
ning of the century. Factorial investigations normally require 
the application of at least a dozen tests (Americans prefer forty to 
fifty) to several hundred subjects, and the labour of calculating the 
Correlations and extracting the factors is almost too great to be 
done without mechanical aids. Moreover, the results are so affected 
by the particular tests used, especially when the battery is small, 
and by the background, sex, age, and other characteristics of the 
Populations tested, that it is only by co-ordinating the findings of 

cf. Vernon, P. E., and Parry, J. B., Personnel Selection in the British 
Forces (1949), Univ. of London Press. 
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š 4 el 
numerous analyses that reasonable certainty begins to emi | 
Finally, we shall see that different analysts often interpret the s: 7 


J 


results differently, though the confusion to which this leads is more 
apparent than real. 


Chapter Six 


TEST ‘RELIABILITY’ 


The term ‘reliability’ is used very loosely by psychometrists. 
Its meaning ranges from the comparatively narrow concept 
of consistency, through validity-plus-consistency, to some 
broad notion which embraces these two in addition to some 
vague ‘goodness’ reminiscent of the goodness of the most 
satisfactory Gestalten. The range is wide but the distribution 
is not symmetrical: more often than not, psychometrists who 
speak of the reliability of a test are referring to its consistency, 
but this again is not a uniquely descriptive term. The con- 
sistency of a psychological test may be assessed by several 
different methods and the result of each method answers a 
slightly different question. 

Measuring instruments of physical variables, such as rulers 
and thermometers, are highly reliable, given certain condi- 
tions—and a good deal is known about these conditions. 
The word ‘reliable’ as applied to such instruments is 
unambiguous. If we measure a table in summer or winter, 
rain or fine, with a tape-measure or an expanding ruler, in 
‘inches or in centimetres, we gain an answer which is subject 
to relatively small variations only. If we wish to check the 
temperature at which water boils, we know that the height 
above (or depth below) sea-level at which we place our 
experimental kettle is important. Holding this and other 
relevant factors constant, again we obtain results which 
vary very little. 

In psychological testing, the stage has not yet been reached 
when most of the relevant factors are known; many of those 
suspected of being important are not controllable; and the 
whole field of mental measurement is intrinsically more 
complicated. It is hoped to indicate some of the reasons for 
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this complexity and, by means of clarifying the terminology, 
to clarify some of the problems. I propose to discuss first 
some of the methods of estimating the consistency of tests 
and secondly, the relation between test-consistency and 
test-validity. In the subsequent chapter, the various type 
of unreliability associated with psychological testing will be 
outlined. S 

It is evident that ‘consistency’ means ‘repeatability’ or 
‘reproducibility’, the extent to which the test, as a measure, 
agrees with itself. Expressed in this way, it is evident too 
that this is quite distinct from the test’s validity, that is, the 
extent to which the test agrees with some external criterion. 
Consistency is essentially internal to the test. This will become 
clearer in the discussion of the methods by which it 1s 
estimated. 

The four most common of these methods are (a) test- 
retest, (b) item analysis, in some one of its various forms, 
(c) split halves, (d) parallel versions of the same test. 


(a) Test-retest 


With this method, the same group retakes the same 
test after an interval of time. The two sets of scores sO 
obtained are then correlated and the extent of the agree- 
ment is taken to indicate the consistency of the test. No set 
rules are laid down as to size of group, the length of time- 
interval or the minimum agreement between the testings, 
below which a test may be considered useless for practical 
purposes. It is, however, desirable that the group used for 
standardizing the test (standardization should cover con- 
sistency as well as norms) should be similar in composition 
to the type of group likely to take the test in the future. It is 
an asset if the two frequency distributions approach 
normality, partly because it is then that the product moment 
method of correlation, with its several advantages, is most 
efficient and partly because it is doubly difficult to interpret 
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the role of improvement, between first and second testing, on a 
skewed population. If the first distribution is symmetrical, 
we may expect this sort of result from the two testings: XY 
but if the first distribution is, for example: N then a 
second distribution of ~~ or even -~ might result and 
this might influence the apparent consistency of the test—for 
Statistical rather than psychological reasons. It is impossible 
for r to be unity if the two distributions are not identical. 
Equal absolute differences between two pairs of individuals 
Situated at different positions of the distribution curve do 
not, of course, have equal significance even when the curve 
is symmetrical. But the differences in significance are liable 
to be harder to interpret, and are perhaps of greater 
magnitude, if the curve is skewed. 

There is some evidence that the longer the time interval 
between testings, the less the improvement of the group, as 
a whole, from first to second score. It is not known, however, 
whether variations in time interval affect subjects with unlike 
initial scores differentially. This point clearly has some bear- 
Ing on test-retest consistency and until the requisite data 
are available it might be desirable to standardize the length 
of time between first and second testing, when estimating 
test-retest consistency. 

It may, in fact, be argued that test-retest is primarily a 
method for assessing differences in degree of improvement 
between subjects on test performance. But the psycho- 
Metrists using this technique have been interested in the 
Maintenance of rank order within the group rather than 
With the absolute scores on the two testings and an improve- 
ment which is uniform will not of course upset the ranking of 
subjects at all. However, every member of the group does 
not always improve: a few gain a lower score on second test- 
ing and even fewer gain the same score both times. Just as 
this unpredictability tends generally to be greater on per- 
formance than on paper and pencil tests, and on personality 
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than on cognitive tests, so it tends consistently to be greater 
on certain intelligence tests than on others, suggesting that 
the notion of test consistency over a period of time has some 
value and that there is a sense in which tests of intelligence, 
at least, genuinely vary in this respect. 

The main advantage of this method is the inclusion of a 
new dimension—one that is external to the test and inde- 
pendent of it—time. Allowing an interval (whether of a day, 
a week or a month) to elapse between successive testings 
automatically allows a good many other factors to enter into 
the test situation. It enables a tentative answer to be given 
to some of the questions which perplex the man in the street, 
the subject himself, and the psychologist. What part is 
played in test performance by the time of day (people’s 
diurnal rhythm varies)? By the day of the week (reactions to 
Monday, for instance, are not necessarily uniform)? By the 
season of the year (extremes of temperature often produce 
extreme and varying reactions)? How about the man who 
had just received his calling-up papers? And the girl with a 
headache? Surely the mental ability of people is not con- 
stant at all times and in all circumstances? And if not, surely 
mental test results to be of value should reflect these changes? 

The test-retest method provides a few data on these 
problems although psychometrists, when estimating the 
consistency of their tests, have naturally endeavoured tO 
hold constant as many of the testing conditions as possible. 
These endeavours have met with varying success and the 
results in general have shown a remarkable degree of agree- 
ment between first and second testing, especially in pape" 
ests. Correlation coefficients of the 
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of knowledge has been built up, revealing which sorts of 
psychological tests are most, and which least, likely to give 
similar results when repeated after a time interval. More- 
over this method yields more data than do either of the 
next two on the accuracy of the test as a measuring instru- 
ment. Methods (b) and (c) are rather like dogs chasing their 
own tails: (c) is perhaps the more dachshund-like of the two. 


(b) Item analysis 

This method is so unlike test-retest that it is hard to find 
any justification for applying the word ‘reliability’ to both. 
No time element enters into this method: in fact, nothing 
enters which is external to the test itself. The item analysis 
1s useful for many purposes. These include ascertaining the 
relative degree of difficulty of each question (a prerequisite 
to arranging them in order of difficulty), determining the 
cogency of individual questions (in so far as this can be done 
on an internal criterion) and, according to some psycho- 
metrists, ‘validating’ the test. Item analyses are sometimes 
used by psychometrists wishing to ‘purify’ their tests. Having 
attained already a high degree of purity by restricting 
themselves to one medium, one bias and one type of prin- 
ciple, they continue along the same lines, removing those 
individual test questions which fail to correlate very highly 
with total test score. This process is sometimes referred to— 
indefensibly in my view—as ‘validating the test’. However 
the present concern is with item analysis only as a measure 
of test consistency. 

In such an analysis, the score for each question in turn is 
compared with the total score on all questions in the test. 
The mean correlation found, in this series of comparisons, 
yields an estimate of the self-consistency of the test. This 
kind of item analysis, like most, is satisfactory only if the 
test has been given with unlimited time. Ifa time limit is 
imposed, unless it is such a generous one that it defeats its 
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own end, most of the subjects will not have attempted some 
of the questions and the minority tackling the omitted, or 
the generally unreached, questions are liable to be unrepre- 
sentative of the group as a whole. In fact, they tend to form 
a sub-group which may well be selected in respect to the 
point at issue. Furthermore, the tacit equating of omissions 
with wrong responses which is entailed if any questions 
remain unattempted, is a rather dubious procedure. Ideally, 
the order of presentation of the questions, at this experi- 
mental stage, should be randomized for each subject. Usually 
the second best method is adopted, by which the questions 
are presented to everyone in the same order, subjects being 
asked to work through to the end systematically, without 
omissions. Pi: 

Whilst deprecating the practice of over-purification, it 1s 
realized that item analysis—which enables the deviser of the 
test to recognize those individual questions which correlate 
non-significantly or even negatively with total score—may 
bevery useful. The degree of heterogeneity (on this criterion) 
to allow in a test is controversial. Other things being equal, 
the more homogeneous a test, the greater its ‘reliability’, ie 
this particular sense: people who do well in one of its 
questions will tend also to do well in the others and those 
who find certain questions difficult will find others difficult 
too. 

This method has the practical advantage that the requisite 
data are collected at one sitting. They yield a good measure 
of the internal consistency of the test—a better measure than 
is yielded by application of Method (e). 


(c) Split Halves 

In this method, as the name suggests, the test is given tO 
a group of subjects and is then split into two equal parts, a0 
the score on one part is compared with the score on the other 
part. Again unlimited time should be given to ensure that 


Test ‘Reliability’ 73 
all questions will have been attempted by all subjects and, 
again, the order of questions should either be identical for all 
subjects or be randomized. 

The fact that the length of the test is halved reduces the 
correlation, and a correction can be made for this by the 
Spearman-Brown formula. Comparing the two halves is 
treated as the equivalent of comparing two tests. (It is of 
course important that the two halves should be roughly the 
same level of difficulty.) They are clearly short tests and 
unless allowance is made for this, their intercorrelation will 
be automatically lowered. This assumption has been 
experimentally verified. 

If all the questions in the test are similar in kind (e.g. are 
all diagrammatic analogies or all verbal directions), the split 
may be effected by taking alternate questions as they occur 
In the test and correlating the score on the odds with the 
Score on the evens. If, however, the test contains several 
different kinds of question—arranged either cyclically or in 
batches—the task is more complicated. It cannot be assumed 
that Analogy 1, for instance, is perfectly matched in 
difficulty with Directions 1; nor should it be assumed that 
these differences in type and medium will even out with a 
large number of questions, especially as there is some evi- 
dence that certain types of principle may be intrinsically 

arder than certain other types. For these reasons, when 
the test contains diverse material, the comparison should be 
made within types of question. 

It will be seen that this method is similar in many ways to 
Method (b): it might legitimately be described as a very 
crude variation of it. In neither method does the time 
element play any part; in both, what is obtained is purely 
internal to the test and sheds no light on its value apart from 


iled 
1 Gulliksen, H., Theory of Mental Tests (1950), pp: 65 ff. For a detailec 
discussion on’ epee Methods of Obtaining Test aN a 
see Chapter xv. Guilford, J. P., Fundamental Statistics in Psychology ant 
Education (1950), pp- 492 ff. 
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its evenness as a measure. There is no need to use the Split 
Halves method if an Item Analysis has already been done. 
Both techniques yield data which are necessary but not 
sufficient for the production of a satisfactory intelligence test. 


(d) Parallel versions of the same test 

This method again answers quite different questions from 
those implicit in Method (a), and what it shares with 
Methods (b) and (c) is very limited. Here the psychometrist 
devises several ‘parallel’ tests, that is, tests which he tries to 
make as equivalent (in intellectual level, gradient of 
difficulty, etc.) as is consistent with their having no questions 
in common. Let us assume, in this instance, that he devises 
two such tests; he may of course produce three or more. He 
then gives both his tests to the same group of subjects and 
calculates the correlation between the scores on the two tests. 
(It is desirable that half his subjects should take Test X first 
and the other half, Test Y first, as a check on the equivalence 
of the tests.) Not only are differential practice effects impor- 
tant here: since the object of this technique is to determine 
the equivalence of the tests, identity of level and of distribu- 
tion of scores is as essential as identity of rank order of sub- 
jects within the group. 

Recent research suggests that a further factor, hitherto 
neglected, should be taken into account—the length of time 
required for the solution of each question. Cane and Horn, 
working with a series of tests of spatial perception, found that 
the difficulty of an individual problem should perhaps be 
defined in terms of the time spent on it no less than of the 
proportion of subjects who solve it correctly and that there 
is no straightforward connection between the rightness 
or wrongness of an answer and the time spent on it. 
Their results suggest that a pair of parallel tests devised 


1 Cane, V. R. and Horn, V., ‘The Timing of Responses to Spatial 
Perception Questions’ (1951), Quart. 7. exp. Psychol., u, 3. 
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without considering these points may well meet before 
infinity! 

It may be argued that knowing the extent of agreement 
between Test X and Test Y tells us nothing of the agreement 
of Test X with itself. On the other hand, it may also be 
claimed that the agreement of Test X with itself is unlikely 
to be less close than its agreement with any other test. Our 
attitude to this claim depends on our interpretation of the 
phrase ‘agreement with itself’: we have seen that this has at 
least two distinct meanings and that both of these are 
susceptible of differences in degree. 

Method (d) is now revealed as partaking something of the 
test-retest method and something of the other two. Itis usual 
to allow a time interval between giving Test X and giving 
Test Y. In this respect Method (d) resembles Method (a). 
On the other hand, if a low correlation is found between 
X and Y there is no way of telling whether this is due to 
inconsistency [of type (a) or (6)] in either or both tests, or to 
lack of agreement between the two separate tests, or whether 
the low correlation owes something to several of these 
factors. 

Those who favour the method of parallel tests presumably 
think to kill two birds with one stone. It seems to me, how- 
ever, that they are more likely to find themselves with two 
stunned creatures both enjoying reasonably good hope of 
recovery. At its most modest, however, this method suggests 
to the test-deviser the minimum internal consistency that he 
is justified in assuming for each of his two tests. Moreover, 
parallel testing has an advantage over the test-retest method 
in that practice effects are liable to be far less important 
when the tests are similar than when they are identical. 


The question whether it is possible for a test (or any other 
measure) to show a closer correspondence with another test 
than it does with itself is an interesting one—which may be 


76 Appraisal of Intelligence 


answered differently by the psychologist and the statistician, 
respectively. It arises in practice over the problem of 
validity and consistency. Can a test with low consistency 
nevertheless have high validity? How justifiable is Tiffin’s 
suggestion that whilst in vocational guidance ‘there is no 
substitute for high validity’, in vocational selection, tests 
should be used which have high consistency, since their 
validity can always be raised by reducing the selection 
ratio?! 

It is mathematically demonstrable that the validity of a 
test against a fixed criterion can be equal to—but cannot 
exceed—the square root of the test-retest correlation 
coefficient.? (For purposes of this discussion, I shall ignore 
the objections to using the terms ‘consistency’ and ‘validity’ 
as though they possessed some absolute, unalterable mean- 
ing when applied to any one particular test.) Thus, a test 
whose test-retest consistency was 0-49, for example, could 
theoretically have a validity as high as 0-7. It would of 
course often be lower than 0-7 but it is not logically impos- 
sible for it to fall between 0-49 and 0-7. In practice the final 
validity, if corrected for attenuation, may even exceed 0°7; 
for this correction—with a consistency as low as 0°49— 
would step up the original correlation considerably. 

It may sound odd to suggest that a test may agree more 
closely with some external criterion (usually a questionable 

`~ one) than it does with itself. However, those investigators 
interested primarily in the people being tested, sometimes 
confuse the issue by arguing that what they are trying to 
measure is known to vary with time and place in a given 
individual, that the test score does so vary, and that this 
variation confers a cachet on the test. ‘It couldn’t be valid,’ 
they assert, ‘if it were completely consistent.’ 

It is surely true that people vary in the uniformity of their 


1 Tiffin, J., Industrial Psychology (1942). 
2 McNemar, Q., Psychological Statistics (1949), p. 136. 
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reactions to similar (non-laboratory) situations, that some 
vary in this respect more than others and that some situations 
are more likely to enhance this variability than others. It 
may reasonably be assumed, for example, that this ‘unreli- 
ability’ of the individual has played a big role in the repeated 
failures of ‘personality tests’ to yield satisfactory results: the 
role is probably greater in personality than in cognitive 
fields. But where the results are positive, that is, where certain 
tests (notably of intelligence) are found to have a relatively 
high test-retest consistency, this can to some extent be taken 
at its face value; and when a test aiming to assess similar 
qualities is found to yield low consistency it may be inferred 
that the inconsistency lies in the particular test rather than 
in the subject. 

In any case, all that our hypothetical psychologist could 
say about the highly ‘sensitive’ test would be that it is a 
means of assessing the subject on test performance, at that 
particular time: the subject varies and so his test score varies. 
It is evident that the claim made is very limited and quite 
unprofitable. Once a psychologist accepts marked incon- 
sistency of people in respect to the quality he wishes to assess, 
he is admitting the untestworthiness of that quality. 

The suggestion in Tiffin that for purposes of vocational 
selection, reliability may be a substitute for validity, whilst 
this is not true of vocational guidance,’ seems to me to bea 
shorthand statement of the difference in viewpoint between 
the ‘guider’ and the ‘selector’. The former aims to advise an 
individual towards a career which will yield him satisfaction 
and success: it would not help the adviser or his disappointed 
client, five years later, for example, to decide that most 
people attempting the job would have made a more drastic 
failure of it! The ‘selector’, on the other hand, aims to advise 
an organization how best to select its employees, in order to 
effect the least wastage. His data include, among other 


1 Tiffin, op. cit. (1942). 
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things, the number of vacancies and the number of appli- 
cants, and the extent to which the latter exceeds the former 
determines the extent of the selection problem and the selec- 
tion ratio which Tiffin mentions. The selector may properly 
be described as a rejector (of a specified number) of the least 
promising candidates. Thus in the selector’s case, the 
satisfactoriness of the employees rather than their satis- 
faction is his object—although in practice the two tend, of 
course, to be related. (I shall not concern myself here with 
‘allocation’, which falls somewhere between guidance and 
selection in practice and in point of view.) 

To speak of ‘raising the validity’ of a test by ‘reducing the 
selection ratio’ is surely misleading on two grounds. It 
suggests that it is easy to play about with one’s selection ratio 
in order to arrive at the requisite degree of validity— 
whereas, in practice, the ratio of vacancies to applicants is 
usually predetermined and is rarely, if ever, under the 
jurisdiction of the tester. Secondly, the phrase suggests that 
‘validity’ is some one quality inherent in a test, susceptible of 
quantitative variation only—and controllable in this respect: 
The objections to this view are discussed in the chapter on 
test validation, 

The term ‘reliability’ has been shown to be applied to a 
variety of methods anda variety of notions, having relatively 
little in common. Each has its uses but to employ the same 
word for all these (and, occasionally, yet other) concepts 
seems a false economy. Since the practice has led to con- 
fusion, the banishment of the term ‘reliability’ is advocated 
and in its place the term ‘consistency’ is offered. The parti- 
cular method of determining consistency should always be 
specified as, for instance, ‘test-retest consistency’, ‘split halves 
consistency’. It is better to be cumbrous and clear than 
elegant and equivocal. 


Chapter Seven 


UNRELIABILITY IN MENTAL TESTING 


Since I have rejected the term ‘reliability’, it may seem 
curious to devote a chapter to a discussion of types of 
‘unreliability’. However, as I wish to evaluate some of the 
causes of inaccuracy in mental testing generally, restricting 
myself to factors internal to the test, the term ‘unreliability’ is 
perhaps permissible. I am not concerned here with any one 
particular kind of inconsistency: I am interested in the sorts 
of reasons which render mental tests less satisfactory methods 
of assessment than are the measuring instruments used in the 
physical sciences and in everyday life. In the list which 
follows, the classes are not, of course, mutually exclusive— 
and they probably do not, jointly, exhaust all the pos- 
sibilities, 


1. The test results may be unreliable because the test 
itself is a poor measuring instrument, in the sense that a fairly 
taut piece of elastic would be a poor instrument for measur- 
ing the length of, say, a bookcase. The analogy might even 
be extended, bearing in mind certain tests, to using a pair 
of scales, with weights of unknown value, for measuring the 
temperature of the bath water! 

Physical measuring instruments, even when employed 
with appropriate media, are sometimes fallible. They may 
be too crude or too fine, or relevant material conditions may 
have been ignored; moreover they are usually subject to 
observer error. But their accuracy is nevertheless of quite a 
different order from that of mental tests; it is different in 
kind as well as in degree. If, in the case of the bookcase, for 
example, the measurer used an appropriate instrument such 
as a ruler calibrated in inches and tenths of an inch, he 
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would be unlikely to find a discrepancy of more than, say, 
two tenths if he measured the bookcase at a later date or 
ifsomebody else measured it with another ruler or even with 
a tape-measure. 

This relatively high degree of accuracy is due in part to 
the fact that there is no doubt about the existence and nature 
of what is being assessed, and in part to the justified assump- 
tion that the units of measurement—in this case, inches— 
are equivalent. In mental measurement, so called, the 
assumption again is made that the units—in this case, marks 
on test questions—are equivalent but here its justification is 
exceedingly doubtful. 

In most intelligence tests the score consists of the sum of 
the right responses: if the subject has correctly answered 30 
out of the 40 questions comprising the test, his score will be 
go. An unweighted score reached in this way implies that 
the questions in the test are equivalent, in nature and in level 
of difficulty. Psychologists are in fact well aware that the 
questions in most intelligence tests are not equivalent since 
they take steps to arrange the questions in ascending level 
of difficulty, with the object of allowing for differences 
between their subjects’ speed of working! and for practice 
effects within the test. The fact that the test-deviser aims at 
this gradient of difficulty, plus the fact that such a gradient, 
objectively measurable and uniform for all groups, is virtually 
impossible to achieve (owing, precisely, to practice effects 
within the test), throws some doubt on the notion of the 
equivalence of test questions, on which most scoring of tests 
is implicitly based. 

There are, of course, other, more complicated, weighted 
systems of marking in which for example, wrong answers are 
penalized though omissions are not, and the solving of a 
‘difficult’ problem is more generously rewarded than the 
solving of an ‘easy’ one. But these systems are usually highly 

1 See Chapter rx, p. 114. 
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arbitrary and are open to at least as many objections as is the 
simpler system of marking. 

The non-equivalence of test questions is not the only cause 
of the unsatisfactoriness of mental tests as measuring instru- 
ments. All the difficulties inherent in gauging physical 
attributes apply to mental measuring but they tend to be 
less easily recognized. The ruler appropriate for estimating 
the length of the bookcase would be unsuitable for measuring 
the diameter of a thin steel rod, and even more unsuitable 
for directly measuring its circumference. What is the mental 
analogy here?—perhaps using a test which discriminates 
satisfactorily between members of a random sample to 
differentiate between members of a highly selected, intelli- 
gent group; or, perhaps, using a test without preliminary 
examples to assess members ofa group who are known to vary 
widely in their previous test experience. 

2. A psychological test may be unreliable because the 
quality to be assessed (if indeed such quality exist) is an 
intrinsically variable entity, such as body temperature, for 
instance. It is known that the latter varies with age and 
with time of day and, moreover, that the extent of this 
variation varies with the individual. If intelligence (or 
memory or persistence . . .) varies in the degree to which it 
manifests itself within the individual, then however con- 
sistent, appropriate and sensitive the measuring instrument, 
it will not yield the same reading on successive occasions. 
In fact, the more sensitive the instrument, the more liable 
will it be to yield fluctuating results. ies 

It is probable that this particular source of unreliability 
plays a bigger role in tests of temperament than in cognitive 
tests. No one without psychological training is likely to 
assume that a man who reacts courageously to one type of 
situation (say, when piloting a damaged aircraft) will 
necessarily tend to react courageously to another type of 
situation (say, when required to make a speech to a large 
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body of people) and yet again to an admittedly artificial 
test situation. 

The evidence on this point—whether experimental, 
observational or introspective—suggests that intrinsic vari- 
ability is less in cognitive matters generally and leastin matters 
of intelligence, specifically. Nevertheless the possibility of 
its influencing intelligence test results should not be denied: 
particularly dangerous is the denial when it is followed, 
logically enough, by the inference that there is a ‘true score’ 
representing the individual’s ‘true intelligence’ and that any 
recognized unreliability must therefore be corrected for. 

3- Unreliability of test results may be due to circumstantial 
effects. Under this heading I wish to subsume all those 
phenomena which may influence members of the group, 
wittingly or unwittingly, at the time of taking the test. These 
will include circumstances immediately preceding the test- 
ing, contemporaneous with it and perhaps, since coming 
events cast their shadows, circumstances immediately follow- 
ing the testing. These circumstancial effects may be classi- 
fied, for convenience, into three categories—which overlap 
considerably: (a) circumstances known or easily ascertain- 
able by the tester and the subject, (b) circumstances unknown 
to the tester but known to the subject, (c) circumstances 
unknown at the time of testing to both tester and subject. 

Category (a) is the least controversial and the easiest on 
which to gain convincing evidence. It would include such 
phenomena as the weather, the international situation, the 
technique of the tester, the time of day at which the test 1s 
taken. Circumstances of type (a) are in operation equally 
for all the subjects tested together, but their effects on different 
members of the group may be far from identical. For 
example, a tester with poor technique may succeed in 
reducing the score of his group as a whole: he conveys an 
impression of contempt or lack of interest, he hurries through 
the instructions too fast to be clear and he does not think to 
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switch on the light although the daylight in the test-room is 
inadequate. Motivation generally will be poor with these 
subjects, many of whom will be too confused and nervous 
to do themselves justice. All will be liable to be affected by 
the tester’s manner but some will be more affected than 
others. On the other hand with circumstances such as 
rainy weather, it may be relatively few who are affected at 
all—perhaps the cricket blue who has a fixture in the after- 
noon and the market gardener tested after a period of 
drought. 

This brings us immediately to (b), circumstances unknown 
to the tester but known to the subject. In the two instances 
just given, the unreliability of the test was due to a combina- 
tion of type (a) with type (b). At some other time, rainy 
weather might not have affected the scores of the cricket 
blue and the market gardener and, equally, their absorbing 
interests might not have affected the scores had the weather 
been fine. The typical (b) type of circumstance, however, 
plays its part in influencing the score of an individual, what- 
ever the particular conditions at the time of testing. 
Examples of these would be a hangover, rise in wages, 
resentment at being tested. The subject may have such 
circumstances more or less in mind and, according to their 
vividness and to the interest which the test holds for him, 
they will be more or less liable to influence his performance 
on the test. 

It will be observed that my illustrations include both 
phenomena which are likely to raise the spirits of the subject 
—with, perhaps, consequent increase in energy, generally— 
and also those likely to depress him, mentally or physically. 
It is noteworthy that subjects, when afforded the opportu- 
nity, quite often confide to the tester that the test conditions 
operative in their particular case were singularly unpro- 
pitious—with or without the explicit inference that the 
subject would have done much better in other, happier 
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circumstances. The contrary information—that the subject 
feels that he has done himself more than justice—has, to my 
knowledge, never been disclosed (if I except the occasional 
subjects who volunteer the information that they have had a 
good deal of experience with psychological tests). The 
subject who receives an hour before the test a welcome 
letter from his fiancée, announcing her imminent visit is as 
likely to claim that his consequent excitement ‘put him off ; 
as is the subject who has just had a disastrous quarrel with 
his landlady. 

However, this tendency to produce evidence that one 
‘ought to have’ gained a different test score—and always a 
better one—should not be received with mere smiling 
cynicism. There is an important sense in which a subject is 
always at least as good as his test performance, and possibly a 
good deal better. For a great many reasons, an individual 
may fail to do himself justice on a psychological test, that is, 
he may produce a score at variance with his ability as 
assessed on other grounds. If the discrepancy is positive—if 
he appears to do ‘unduly well’—the test will have provided 
interesting and instructive psychological data. But if the 
discrepancy is negative—and he does surprisingly badly in 
the test—then the interest should lie in the interpretation of 
this discrepancy rather than in the test score, as such. 

Type (c) is probably the most frequent producer of 
discrepancies of this kind. Under (c) I include those many 
circumstances, relevant to test performance, which are 
nevertheless unrecognized both by the subject and the tester 
at the time. Such circumstances are not necessarily hypo- 
thetical: their influence can sometimes be observed, and 
their existence inferred, some time after the testing. An 
uncontroversial example would be the cold or the illness 
which developed some days after the testing and whose 
onset was just below the threshold of introspection by the 
subject at the time of taking the test. Most of us are familiar 
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with the experience: ‘so that’s why I felt so lethargic (or 
behaved so stupidly)’, etc. accompanied by a vague feeling 
of relief and of awareness that now, for the first time, the 
lethargy (or stupidity) has entered consciousness. 

4. Effects of practice. The effects of practice in increasing 
test unreliability may be regarded as a special case of 
‘circumstantial effects’. I shall consider them separately, 
however, partly because they are such a very special case 
and partly because they play a particularly important role 
in the theory and practice of mental testing. 

The main agreement among research workers in this field 
has been on the striking improvement in test performance 
shown by practised subjects, irrespective of the tests and the 
groups used, of the length of interval between testings and of 
the method of computing increase in test score. Research 
workers have in fact shown a commendable catholicity in 
these matters—which renders difficult any reconciliation of 
their more specific results. 

However, I am here concerned primarily with differential 
improvement as between members of a group or as between 
groups of different mental levels. Had the various investi- 
gators agreed on this point, and agreed that the speed and 
extent of improvement were uniform for all subjects throu gh- 
out all retestings, practice effects would not make a material 
contribution to the unreliability of psychological tests. 
Neither of these postulates is true, however. 

There is some evidence that, given equal opportunity, 
more intelligent groups tend to improve with practice to a 
greater degree than less intelligent groups and that the more 
intelligent members of a relatively homogeneous group tend 
to improve to a greater degree than their fellow subjects. 
(For the sake of brevity, ‘intelligent’ here denotes ‘initially 
gaining high score on intelligence tests’.) But the oppor- 
tunity is rarely ‘equal’. The tendency has been to use for all 
groups intelligence tests originally devised, for use with 
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random samples of the population. Such tests are too easy 
(or too much tests of speed) for subjects who are both intelli- 
gent and practised to be able to increase their test score 
substantially, after a few testings: they are tackling most or 
all of the test questions and answering most of them 
correctly. The initially average and poor scorers have far 
more scope for improvement. They have more new prob- 
lems to tackle and more errors to correct. They have m 
fact been found in several inquiries to improve to a greater 
extent than the initially high scorers. In such cases, the 
range of test scores naturally narrows—all subjects in due 
course approaching the possible maximum score—and 
differentiation and consistency fall. 

In order to afford full and equal opportunity to all subjects, 
a test that is sufficiently difficult and sufficiently long must 
be employed with groups which include highly intelligent 
members. If such a test is not available, groups of duller 
subjects may be used. But neglect of the relation between 
the intellectual level of the group and of the test is liable to 
produce misleading results in experiments on practice 
effects (or on any other problem in mental testing). : 
A It would appear that whatever this relation, differential 
improvement is the rule rather than the exception; an 
whilst some degree of test-familiarity is also becoming the 
rule rather than the exception, the degree is by no means 
uniform for all members of all test groups. The test- 
experience of subjects varies qualitatively as well as quanti- 
tatively: some have met the same test on several occasions 
and are now confronted with it yet again, others who have 
met only one test before find that it is a different one from 
the present test. Again, some subjects have had ‘knowledge 
of results’ during or after testing and others have not; some 
have had systematic training in the technique of taking 
intelligence tests and others have been to schools whose 
principals have principles. It is possible that in certain 
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conditions—the test being just sufficiently similar and 
sufficiently different—negative transfer’ might occur, the 
more test-sophisticated subjects finding themselves at a 
disadvantage. There is some evidence, too, that the con- 
ditions in which the first of a pair or a series of tasks is 
carried out may affect the performance of subjects not 
merely on that first occasion but on subsequent occasions 
when the subjects are confronted with a similar or identical 
task.+ 

To many of these problems the answer is at this stage ‘more 
research’. But available data make it clear that previous 
test-experience should not be ignored, when attempting to 
estimate the inherent unreliability of a test. 

5. Observer error. Under this heading I wish to include all 
the ‘unreliability’ of a test which can be traced to errors in 
testing technique or in scoring. The former type of error 
might arise for instance with a witless but keen subject who 
cheats, undetected and fruitfully, by copying from his 
neighbour’s answer sheet on first testing and who has the 
misfortune on second testing to be placed next to a subject 
who proves useless from this point of view. His two scores 
are likely to be widely discrepant and to contribute largely 
to the ‘unreliability’ of the test. y I 

It may be argued that cheating is unlikely to occur with a 
competent tester in charge and that given an incompetent 
tester, innumerable and obvious grounds of unreliability are 
at once apparent. Competence in testing is, however, a 
matter of degree—even, to some extent, a matter of indivi- 


dual judgment and taste—and many ofthe errors which may 


arise primarily as a result of testing technique would, if the 
likely be superseded by 


technique were amended, very y 
others. For example, the tester who is on the lookout for 


1 Welford, A. T., Brown, R. A. and Gabb, J. E., ‘Two Experiments 
on Fatigue as Affecting Skilled Performance in Civilian Aircrew (1950), 
Brit. J. Psychol., XL, 4. 
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cheating and who therefore wanders round the test-room 
gazing now at this subject, now at that, is likely to increase 
the nervousness of his more susceptible subjects very con- 
siderably—with results as deleterious to the ‘reliability’ of the 
test as to the well-being of the subject. Testing technique is 
always essentially a matter of compromise. 

I should include under errors arising from testing 
technique the score gained by a subject who has not fully 
taken in the instructions or accepted the conventions operat- 
ing for the particular test. Such a subject may, for instance, 
not turn over to page 2 when he has reached the end of 
page 1, but may sit waiting to be told. This is not always 
observed by the tester in charge of big groups; and, when 
marking, it is of course impossible to determine whether the 
subject was a very slow one who happened to have answered 
the last question on page 1 just as the time limit was reached 
or whether he had awaited specific instructions to turn over. 
If the marker finds many papers in which the answers come 
to an end exactly at the bottom of one page, he will be liable 
to suspect the tester on this count: in this event he will still 
have little idea of what the ‘true’ score should be for these 
subjects. 

This type of misapprehension on the part of the subject 
does not necessarily imply a lack of intelligence. It is just as 
likely to be a continuing result of indispensable instructions 
given early in the session: ‘Do not turn over until you are 
told to do so’ or‘... until everybody is ready’, etc. Counter- 
manding these instructions later, when embarking on the 
test proper, does not always register with every subject. , 

This instance illustrates the extent to which testing 
technique and scoring technique overlap. At one extreme 
we have straightforward inaccuracy on the part of the 
scorer. He marks an answer right when it is wrong, Of 
conversely; he adds up one of the columns, or subtracts the 
number of omissions, incorrectly; or he may successfully 
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master all the preliminaries and make a mistake in his final 
addition (or weighting) of the several parts of the test, or in 
transforming the crude score into a percentile or grade. 
Research on this matter, and personal experience, point the 
same way—towards the necessity of having all scoring 
checked, if possible by a second marker. 

More important, because less easily checked, are the errors 
which arise essentially from having designed the test to 
be scorable objectively and quickly. Let us suppose, for 
example, that the test is of the multiple choice kind (as 
current group intelligence tests nearly always are) and that 
for each problem six possible solutions are offered, of which 
only one is correct. These solutions may be numbered 1-6 
or lettered A-F, or any one of a number of other devices may 
be used. Let us suppose that the correct responses to question 
numbers 24-34 are as shown in Table A, column (i). 

For the purposes of this illustration, the subject (who turns 
out to be highly intelligent in other respects) confuses his 
D answer to question 24 with his D answer to question 25. 
In fact, without realizing it he economizes on D’s and makes 
one do the job of two. As a result he inadvertently enters 
his answer to no. 26 in the space provided for 25—and so on 
to the bottom of the page, giving us column (ii). (It may be 
observed that he made one mistake, when answering 
question 30.) 


TABLE A 
Question no. (i) (ii) 

24 D D 
25 D A 
26 A Cc 
27 Cc E 
28 E B 
29 B A 
30 D B 
gt B E 
32 F a 
33 3 

34 E 


go Appraisal of Intelligence 


After this, one of several things may happen. The subject 
may realize the significance of the empty space at the bottom 
of the page opposite question 34, and may worry about it. 
In this case, he will probably laboriously alter his last ten 
answers, losing valuable time and substantially reducing 
his test score (and augmenting the unreliability of the test). 
Or he may not notice or may not worry, in which case 
the tester will have the task of scoring column (ii) as it 
stands. 

Here again, there are several contingencies. If the scorer 
has some interest in the test or the subjects and meets this 
particular answer sheet early in the marking, the process 
may not yet have become automatic. If, moreover, the sub- 
ject in question had given the correct response to most of the 
problems numbered 1-93, it might strike even the bored 
scorer as odd that the subject should suddenly go to pieces 
and produce eight wrong answers and one omission On 
eleven questions. So the scorer may examine the answer 
sheet more closely and deduce that the ‘true’ score should be 
nine out of eleven (or conceivably ten, if he decides to allow 
the double D which originally caused the trouble). But if 
the subject has not done well in the early test questions, our 
conscientious and discerning scorer will find himself in 4 
quandary. Should he accept the interpretation of column 
(ii) which awards the subject nine marks, and explain the 
change in behaviour between page 1 and page 2 as ‘adapta 
tion’ or ‘practice effects’? Or should he regard the finding as 
a striking coincidence, dismiss his idea as unproven an 
subjective, and award two marks only (for questions 24 an 
32)? He knows that whichever course he choose, he must 
treat all other similarly suspect answer sheets in the same 
way—even those of manifestly intelligent subjects, such as 
the one we discussed above. 

The sensible decision may appear obvious in the relatively 
clear-cut instance I have selected. But how determine the 
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score when adjustment of the marking key would produce 
seven plus answers to four minus, or six plus answers to five 
minus, instead of nine to two? And, if no all-or-none ruling 
be made, how decide where to draw the line? 

A Hollerith machine will not of course be subject to these 
doubts and fears: it will score more rapidly (for this and 
other reasons) but, as we have seen, not necessarily more 
accurately. The answer sheet with which it is presented will 
naturally be in a somewhat different form from that under 
discussion but it is none the less possible for the same sort of 
error to affect the scoring. Methods of recording answers 
and methods of scoring are becoming more and more 
ingenious and swift and objective. But even if errors on the 
part of the scorer or the scoring apparatus be reduced to a 
minimum, the subject cannot be trusted to eliminate such 
errors. In a sense the further removed from human hand 
and eye the scoring, the more liable will it be to ‘observer 
error’ of the kinds I have instanced. Again, it should be 
stressed that there is no evidence indicating that the subjects 
liable to such errors are unintelligent on other criteria—a 
claim which, in the last resort, is sometimes made. These 
few illustrations of types of ‘observer error’ may have 
suggested something of the many types which exist, of the 
interdependence of tester and scorer, of some of the advan- 
tages and drawbacks of mechanical scoring devices. 


Each of the five types of unreliability outlined may be 
found alone, or, more often, in conjunction with one or more 
of the others. It is likely that most of them play some role 
in determining the unreliability which is found to a greater 
or lesser degree in all tests and which psychometrists seek to 
minimize by every means within their power. I have re- 
stricted myself to the more valid grounds ofunreliability, that 
is, those which are inherent in the whole notion ofintelligence 
testing and which cannot be eliminated by improved tests. 


92 Appraisal of Intelligence 


Some of them might be reduced by ‘improved’ selection of 
groups but this would constitute a sophisticated escape from 
the problem rather than a solution. It would appear then 
that a certain quota of unreliability in mental tests is 1n- 
evitable. Thus, whilst every effort should be made, in 
devising tests and administering them, to keep this quota 
as low as possible, it should be recognized and investigated 
rather than ‘corrected for’. 

The practice of correcting validity coefficients for attenua- 
tion is based on the assumption that there is such a thing as 
a ‘true score’. The argument in non-statistical terms runs as 
follows: this test has low test-retest consistency. It does not 
correlate as highly with the external criterion as it would if 
its test-retest correlation were unity. This correlation would 
be unity if we had the true score. We therefore ascertain 
what the validity correlation would be if the test-retest 
correlation were unity. 

I am not concerned here with the science (or ethics) of 
determining how much higher a validity correlation could 
or should be than it in fact is. It seems to me of doubtful 
theoretical value and negative practical value. But I should 
like to discuss some of the implications of the phrase ‘true 
score’. To assert that a given test score ‘is not a true score 
suggests: 4 

(a) That something is being tested—some attribute that is 
real and specific. 

(b). That this attribute is measurable, susceptible. of 
quantitative variation only, and capable therefore of being 
expressed on a linear scale. - 

(c) That the test in question is the right measuring 
instrument for this attribute or so nearly the right one that 
it is justified to make the relevant computations as though 
the instrument were wholly right. It is the fallibility of. the 
instrument which allows the psychometrist to infer a higher 
validity than has been found. 
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(d) That the ‘unreliability’ found in the test is due to some 
defect in the test itself. 

I have queried premises (a) and (b) elsewhere. Premises 
(c) and (d) seem to me question-begging. We are not justi- 
fied in assuming the appropriateness of the measuring 
instrument or the intrinsic consistency of that which is to be 
measured. 

The ‘true score’ equals the obtained score minus the 
variable error (which may be positive or negative); and the 
sources of variable error may be disposed of under the 
headings ‘physiological’, ‘psychological’, ‘scoring’, etc. 
They cannot, however, be admitted as the integral part of 
the test situation which they are, just as constant errors are. 
A constant error is one which affects all the measurements 
in the same direction. Such circumstances as a poorly 
mimeographed test paper or an incompetent tester would 
be sources of constant error. The distinction is a purely 
statistical one. Its artificiality is brought out by the fact 
that the same circumstance, for instance subject A’s head- 
ache, would be a constant error on the split halves method 
and a variable error on the test-retest method of estimat- 
ing reliability. In the former case r will be raised; in the 
latter case it will be attenuated. Even the poorly mimeo- 
graphed test paper, though liable to affect adversely all 
members of the group, will very likely upset some subjects 
more than others. 

The criterion for ‘true score’ is usually high test-retest 
consistency: as the test-retest correlation approaches unity, 
the score approaches ‘truth’. The psychometrist believing 
this and, presumably, the above four premises, should be 
surprised at the large number of tests which do not approach 
a ‘true score’. What is surprising is that the test-retest 
correlation coefficients of certain tests in certain situations 
are as high as they sometimes are. It is clear that the extent 
to which a test produces a ‘true score’ varies with the 
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population tested: a very heterogeneous group tends to 
yield a very high reliability—so high that the psychometrist 
may be tempted to believe that he is close to ‘true reliability’. 
Naturally a very different result will be found when the 
same test is given to a highly selected group. The test-retest 
consistency will fall; the score on the same test will no longer 
be the ‘true score’. 

The correction for attenuation is statistically permissible 
only when all the assumptions required for a sound relia- 
bility coefficient have been met. One of the main assump- 
tions is that the relevant error be a chance error and not such 
that it will lead to a spuriously high r. These assumptions 
are discussed by McNemar in a chapter on ‘Factors which 
affect the correlation coefficient’! As he points out, 
‘corrected r’s greatly in excess of unity have been reported’. 
In such cases the absurdity of correcting for attenuation, and 
the failure to have met the necessary assumptions, are 
obvious. I wish to suggest, however, that the correction is 
never defensible on psychological grounds and that its use 
reinforces some of the more misleading beliefs associated 
with mental testing, 

The unreliability of mental tests should, I think, be mini- 
mized by careful attention to the devising of the tests, the 
technique of the tester and the constitution of the group. At 
present, the constitution of the group is itself sometimes used 
as a further Opportunity for boosting the coefficient of 
validity. In this case the process is known as ‘correcting for 
homogeneity’. The argument briefly is that the test scores 
available for comparison with the criterion have been 
obtained from a highly selected group—or, at least, a 
group that is more highly selected than that which will 
produce the scores when the test is used in practice. There- 
fore, it is claimed, these subsequent and non-experimental 
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scores will yield a wider range and, hence, a higher correla- 
tion with the criterion. : 

This argument may be permissible in a few special in- 
stances. In general, however, it is wiser to gain validation 
data from groups similar to those with which the test will 
ultimately be used. Moreover, the assumption is not always 
justified that the inclusion of far duller (and/or brighter) 
subjects in the group will, while extending the range, 
necessarily preserve the linear relation between test scores 
and criterion. 

It seems to me that intelligence tests are bound to be 
unreliable in some ways, however carefully they are con- 
structed and administered. But they are almost certainly no 
less reliable than other methods of assessing human capacity 
nor are they lacking in value as instruments of research and 
of individual diagnosis. Once the unreliability of any one 
test has been established, however, it should not be used— 
if low—as an argument for stepping up the test's yalama 

It will be seen, in the following chapter, that many of the 
criteria against which tests are validated are themselves 
unreliable. 


Chapter Eight 


VALIDATING INTELLIGENCE TESTS 


I propose to avoid the phrase ‘test validity’ and to speak 
always of validating a test, or of the method of test validation 
used, in any one particular situation. This choice is perhaps 
analogous to the choice of the term ‘remembering’ in ee 
ference to the term ‘memory’—and for analogous reasons.’ 
Use of the word ‘memory’ suggests that there is some one 
entity or faculty or factor inherent in people, possibly ay 
varying degrees and that, insome important sense, it exists in 


vacuo, static and measurable (or, at least, indirectly infer- 
able). Equally, 


tests, carries wi 
quality which a 


That I do not believe in any such simple quality follows 


imple qualities of the mind (such as 
‘intelligence’, ‘memo ”’, ‘persistence’). The phrase ‘validity 
ggests that the exact meaning of 
that a satisfactory criterion of it 
tforward matter to compare test x 
me one point of time—and that this 


implying a process 
will in itself take 


test X?” or ‘How far does test X 
we should ask ‘How h, 
‘how’ constitutes a req 


genuinely assess intelligence?’ 
as test X been validated?’ And the 
uest for information as to the criterion, 
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the size and constitution of the group and the method of 
comparison. 

There are then three grounds for stressing the active rather 
than the substantive aspect of validation. First, I doubt that 
the concept of ‘validity’ has value as a simple entity, 
susceptible to quantitative variations only—since I doubt 
the existence of a corresponding ‘intelligence’. Secondly, I 
believe that validation depends largely on the relation 
between the test and the particular group tested. For 
example, a test which has been convincingly validated on, 
say, a population of army entrants is unlikely to prove 
equally valid when applied to a group of university students. 
Thirdly, the estimation of validity presupposes a satisfactory 
independent criterion with which to compare the test—and 
this is rarely, if ever, available. x 

A satisfactory criterion is of course a prerequisite to any 
sort of validation and finding one is perhaps the most intract- 
able problem in the field of intelligence testing. The diffi- 
culty has been turned to advantage by some mental testers 
who, finding a correlation coefficient of less than 1-0 between 
their new test and their criterion (which has always been 
known not to be a strict measure of intelligence) infer that 
the residual variance is a measure of how much better the 
test is than the original criterion, at gauging intelligence. 
Others ignore the difficulty by resolutely not seeking any 
measure independent of the test—‘validating’, instead, each 
individual question against the internal criterion of total 
test score. They then ‘increase the validity’ of their test by 
altering or removing those items which failed to yield a 
sufficiently close association. This procedure is innocuous, 
provided that it is not termed validation and that no 
inference is drawn as to the practical value of the resultant 
test. As a method of validation it is open to the statistical 
objection that the individual question is usually included 
in the total score with which it is compared; and, more 
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important, to the logical objection that it implicitly assumes 
high validity for the test as a whole—which is precisely the 
point at issue. ; 

Establishing reasonably high internal consistency of this 
kind may or may not be a precondition for test validation, 
whatever the external criterion; but in no circumstances 
should it be regarded as a substitute for the process of 
validating. In fact, there is reason to suppose that the 
homogeneous, pure tests which achieve a very high degree 
of internal consistency may tally less well with general 
intelligence, as the man in the street understands it and the 
non-psychometric psychologist uses it, than do the more 
mixed intelligence tests with their lower internal consistency. 
In a test of intelligence, the more heterogeneous the tests— 
with respect to medium, bias, subject-matter and relative 
difficulty for different individuals—the more scope for 
discrepancies between questions and totals, and the more 
scope for subjects to behave as individuals, each excelling 
in the type of thing that appeals to him. The questions 
which show the least agreement may prove the most interest- 
ing and instructive.! Again the solution to these problems 
can be attempted only in terms of the independent criterion 
chosen, or manufactured. 

By validating a test is meant ascertaining the extent to 
which results on the test tally with some other method of 
assessing that which the test is designed to assess. What 
criteria are there, then, against which to validate tests of 
intelligence? Chronological age; occupational or socio- 


economicstatus; factor analytical results; personal judgment, 
in the form of rati 


ngs or rankings or qualitative assessments, 
based on long acquaintance; personal judgment based on 
an Interview, lasting perhaps an hour, perhaps less; results 
1 For a statistical discussion of this poi i 
‘ S point, sce (a) Richardson, M. W., 
N om the Rationale of Item Analysis’ o en (1936), 1- 
(b) Gulliksen, H., Theory of Mental Tests (1950), John Wiley and Sons. 
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of examinations, either scholastic or professional; other, 
established psychological tests—both those which purport to 
be exclusively cognitive and those with a more clinical, pro- 
jective flavour. All these have been used, separately and in 
varying combinations. Let us consider each of them in 
turn, beginning with the last one. 


(a) To validate a new intelligence test solely by comparing 
it with some already existing test with high ‘validity’ is 
clearly unsatisfactory. Most methods of validation turn out 
on inspection to be circular but this particular circle is 
offensively tiny. The argument js as follows: if test B corre- 
lates highly with test A, then test B is a valid test of intelli- 
gence—since test A has already had its validity established. 
If we should ask why some other, non-test, criterion such as 
was originally used for test A, be not used for test B, we may 
variously be told that, since the data are at hand, this 
method is quicker and simpler or that other criteria are 
hard to come by—or that such other criteria as exist are 
themselves of doubtful validity and that test A is a purer and 
more reliable measure of intelligence. That the last con- 
sideration, if applicable to test B, must have been equally 
applicable to test A is not usually discussed. 

The circle, however, continues to turn round itself. 
Suppose A and B yield a correlation of 0-6, for example, the 
psychometrist has two courses open to him. He may return 
to test B and alter it in the hope that it will, in its new form, 
correlate more highly with test A. Or he may claim that the 
correlation found proves the essential excellence of his new 
test: it shows a statistically significant agreement with test 
A—which has been widely accepted as a valid intelligence 
test—but the agreement is not so close as to suggest that the 
construction of test B has been a waste of time. He has 
profited by experience and produced a test which is just 
that much better. This second course is taken at least as 
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often as the first, although it is not often expressed as ae 
as this. Mental testers in all fields are liable to “sn e T 
fallibility of external criteria as an invitation to have the 
worlds. "i 
ANSA objection to this method is the iby cece! 
underlying it that test validity can have meaning oe T 
particular context—a point discussed at the beginning o a 
chapter. It is reminiscent of the occasional vocatio A 
counsellor who is prepared to judge his interviewee as ee 
or ‘poor’ without stipulating ‘for what?” Even if test A 
claimed as a valid test after, as well as before, its pee pe 
with test B, and the intercorrelation of the two tests be a 
to approach unity, all that can legitimately be inferre : 
that test A and test B demand almost identical abilities, o; 
that the ability required by A is amost always found in con- 
junction with the ability required by B, or that the group 
who took both tests was selected in such a way as to exag- 


gerate the range. It cannot be inferred that the tests both 
possess near-perfect validity, 


I do not wish to suggest that the comparison of new tests 


with older established intelligence tests is useless or even 
that it can be dispensed with, in the process of accumulating 
test data. The practice often yields instructive and unex- 


pected information. But I strongly deprecate its being used 
alone asa satisfactory and sufficient method of validating a 
new test of intellige: 


nce—or of anything else. 
Projection tests, 


such as the Rorschach, and other clinical 
individual tests which aim at a 


other things are in rather 
as assessed by such tests is ni 


telligence essays to measure. 
express their test findings in 

rms. This results in greater 
flexibility and in greater emphasis on the intimate relation 
between the orectic and the cognitive. The understanding 


vaguer, more qualitative te 
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clinician may well have a fuller and more valid picture of 
his subject’s general intelligence than has the specialist 
tester of intelligence. On the other hand, the clinician whose 
vagueness is a refuge and whose flexibility is merely in- 
decision thinly disguised, is likely to offer a less worthwhile 
estimate of his subject’s intellectual level than does the 
rigid tester of intelligence. The case for greater flexibility is 
made in Chapter XI. It is not immediately relevant to 
a discussion on methods of validation. 

(b) The academic examination has something in common 
with the established intelligence test, notably in system of 
allocating marks, which enables the standard methods of 
correlation to be used. There are several important differ- 
ences, however. The examination results are more obviously 
subjective: this has been demonstrated by experiments in 
which several examiners have marked the same set of papers 
or the same examiner has marked the same set on more than 
one occasion. If pass-fail be used as the criterion, rather than 
a list of examination marks, the classification is of course just 
as much a matter of subjective judgment—although the 
simple dichotomy conveys a greater impression of objectivity 
and, being far cruder, offers less scope for inconsistency 
between examiners. (The tendency to infer from such 
consistency that the judgments are necessarily valid is as 
prevalent as is this tendency with tests.) 

A second difference between tests an 
that the latter vary, and are intended to 
subject-matter. Examiners do not set up an elabo: 
statistical apparatus to explain, or to explain away, dis- 


crepancies in score found between candidates’ results on a 
geography examination and a Latin examination—or even 
M.B. examina- 


discrepancies found between first and second 
tions, taken by the same medical students with a time 
Interval of two years. They consider that those candidates 
interested in geography and good at it will not necessarily 


d examinations is 
vary, with the 
rate 
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be those who are keen on Latin and excel in it; and they are 
not surprised to find that some medical students have done 
more work than others in the two years, that some have 
developed an interest in second M.B. subjects which they 
strikingly lacked in the first M.B. subjects and that some 
have undergone psychological or physical experiences with 
apparent intellectual repercussions. f 
This attitude is at variance with that of the psychometrist, 
who is apt to interpret lack of agreement between an indi- 
vidual performance test of intelligence given one year and 
two pencil and paper tests, with different biases, given some 
years later, in terms of sampling error, or low test consistency 
or poor testing technique. The underlying assumption is 
that if there are valid tests of general intelligence, then they 
must all intercorrelate closely. 
This leads me to the third and most important difference 
between examinations and intelligence tests. Examinations 
(whether school, university or professional) do not purport 
primarily to assess intelligence. In so far as it is permissible 


to generalize, they aim to assess knowledge: the extent to 
which information has been acquired and digested. The 
successful assi 


milation of information no doubt does involve 
intelligence, just as it involves diligence, memory and 
attendance on the part of the candidate, and skill and under- 
standing on the part of the teacher. Most examiners would 
claim (with varying Justification) that intelligence is neces- 
sary but not sufficient for success in the papers they set. The 


Stress laid on acquaintance with the relevant facts, will vary 
with the level, the subject-matter and the traditions of the 
examination and 


the taste of the examiner, 

The objections to validating intelligence by means of 
examinations will now be clear. Why take as criterion a 
measure which is not even intended to be primarily a 
measure of intelligence and which is, in addition, known to 
be a very variable measure? The reasoning is rarely made 
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explicit but it often seems to follow this general course: whilst 
not setting out to test intelligence exclusively, satisfactory 
examinations do aim to take intelligence into account; the 
candidate who lacks adequate knowledge but makes an 
intelligent show of what little he does know, is marked up by 
the perspicacious examiner; and, in any case, the sense of 
relevance—so essential in answering examination questions, 
when knowledge is adequate or even abundant—is closely 
related to intelligence. 

If this argument be accepted, a one-way relationship 
between intelligence tests and examinations might be 
expected: no candidates with very low test scores should do 
well, but some with high test scores may do badly, in their 
examinations—owing to laziness, for instance, or illness, or 
inaccurate remembering, or lack of interest. This has, in 
fact, been found in a fair number of the comparisons made 
between examination results and intelligence test scores. Of 
recent years, however, the comparison is sometimes made to 
Justify the selection, or pre-selection, of examination candi- 
dates by means of tests. That is, the validity of the tests is 
assumed and any discrepancies found between them and the 
examinations tend to be accounted for in statistical, rather 
than psychological terms. 

(c) Entirely statistical is the validation of mental tests by 
means of factor analysis. Little need be said here of this 
method since it does not fulfil the important condition of 
independence and it has been discussed in Chapter V. The 
method earns mention here mainly because it plays such a 

arge role in contemporary mental testing, at all stages. For 
many psychometrists a test is both defined and validated 
when, and only when, it has been administered in company 
with a number of other tests and the test scores have under- 
arid analysis, in one or other of its forms, An 
ey enog test is valid, then, in proportion to its first factor 
ration: the more highly saturated, the better a test of g 
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Even were the results not contingent on the parina 
battery of tests, the particular group or groups of ae = 
and the particular technique used, it is clear that no am ie 
of manipulation and computation of the test scores can y a 
information as to the correspondence between a set of suc n 
scores and some independent characteristic human mode o! 
behaviour, such as the test was originally designed to assess. 
The data gained would scarcely justify the necessary 
redefining of the term ‘validation’, x 

(4d) Occupational and socio-economic status come se i 
on my list and I Propose to treat these together, despite t se 
not being quite the same thing. Occupational status Leen 
fewer difficulties of definition than socio-economic, with n 
faintly snob connotation, and plain ‘occupation’ presen A 
fewer still. (These criteria have the unusual advantage o! 
being applicable without artificiality to adults.) 

The phrase ‘occupational status’ 
some sort of hierarch 
of an intellectual hi 
the stage of validat 
circularity attenda; 
Preconceived noti 
dentist over the 
the shepherd? 
occupational gr 
order at some | 


implies immediately 
y within jobs, perhaps even some flavour 
erarchy. It is important to avoid this at 
ing tests, in order to avoid the insistent 
nt on test-validation. Do we begin with 
ons as to the mental superiority of the 
Precision fitter, and of the bank clerk over 
Or do we infer the mental status of the 
oup from its mean on an intelligence test, in 
ater date, to use the test score of an individual 
Owards a suitable vocation? The latter would 

imate procedure if the data included the satisfac- 
tion and the satisfactoriness of the original subjects in their 
jobs. It might b i 


> y and y +25, 
iries have been made in which the test scores 
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gained by members of different occupational groups have 
been averaged and the means then arranged in order.! But 
very rarely has information been obtained on the efficiency 
and the well-being of the individuals concerned. It is true 
that such data have been included in follow-ups of voca- 
tional guidance investigations (mainly with children) but 
these have not strictly constituted validation of specific tests. 
The vocational advisers have, rightly, been concerned with 
the subjects as complete and complex individuals—with 
interests, ambitions, fears, not susceptible to straightforward 
aoe able to compensate here and to overcompensate 
ere, 

The results of inquiries which consisted simply of com- 
Paring mean test scores for adults in one type of work with 
mean test scores for adults in other types of work have 
agreed fairly well with one another and yielded three points 
of interest. First, the final order of jobs has usually tallied 
roughly with the order in which common sense would rank 
them, if forced to do so. But, secondly, when the ranges of 
Scores within occupations were considered, a very big over- 
ap was found—not only between occupations of adjoining 
E but also between those of widely separated ranks. 
wee tendency has been found for the range to increase 
for e lowliness of thejob: the scores ofunskilled labourers, 

example, are liable to show far more scatter than the 

Scores of school teachers. 
ET results are instructive but until additional data are 
as on the proficiency of the tested labourers and school 
ers and others at work, we should limit ourselves to 


Meditatj r 
ditating on the evident tolerance of employers or the 
x i $ 
Seon) cHlimmelweit, H. and Whitfield, J., ‘Mean Intelligence Test 
D 45 (8) G andom Group of Occupations’ (1944), Brit. J. industr. Med., 
Standardi attell, R. B., ‘Occupational Norms of Intelligence and the 
„zation of an Adult Intelligence Test’ (1934), Brit. J. Psychol., 


Xxv; 
Army ee, C. S., and Yerkes, R. M., Mental Tests in the American 
8 
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probable wastage of talent and labour or the qualitative 
differences in jobs and in people which have become masked 
by concentrating on quantitative differences. Even given 
the additional data (which are extremely difficult to obtain) 
we should not possess a foolproof method of validating 
intelligence tests for adults, since so many other character- 
istics and modes of reaction have so many important and 
differential roles for so many different jobs. We cannot even 
assert, in the case of high grade occupations that intelligence 
is ‘necessary but not sufficient’, for one job may be high 
grade in virtue of the heavy responsibility it carries (such as, 
for instance, heat controller in a kiln), a good deal of intelli- 
gence being needed to appreciate the full consequences of a 
lapse; another may require problem solving at a high level 
Just once every week or two, but little thought in between; 
and a third may demand a consistent, but not very high 
level of judgment, day after day. An attempt to iron out 
such qualitative differences and to present the various 
Occupations on one scale—corresponding with a scale of 
intelligence test Scores—oversimplifies to such an extent that 


it is bound to mislead. It can certainly not be regarded as 
convincing validation, 


_ In our search for eri 
judgments of one kind or another, and ( f) chronological age 
or, rather, what-the-child-can-do-at-what-age, If straight 
erally as the criterion, this 
assessing a child’s intelligence 
ing his age. Certainly his age 
rmance is to be compared with 
erein lies the stress which Binet 
ut to know the child’s age and 
uld be of no more value than 
eight or weight. 

gical age as a criterion do in fact 


“Suggest that height or weight may be equally well (or ill) 
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chosen for purposes of test validation. In so far as height and 
weight tend to increase contemporaneously with the child’s 
mental development (and age), they are no doubt possible 
criteria. This should not be confused with the results of cross- 
sectional studies of malnourished children drawn fromsimilar 
age groups, in which a readily understandable connection 
has been found between stunted growth and poor intelli- 
gence test score. 

_ However, from the viewpoint ofcriterion, there are several 
important differences between such physical measures on 
the one hand and chronological age on the other. Height 
and weight vary both absolutely and in rate of increase from 
birth through childhood and adolescence; and individuals 
vary greatly in their height and weight when they have 
stopped growing. Unlike chronological age which, by 
definition, shows no comparable variation, these physical 
variations tally little with the many other criteria of intelli- 
gent behaviour which combine to yield the ‘spiral approxi- 
mation’ quoted at the end of this chapter. 

One further criticism of chronological age as the basis for 
test validation is that intelligence is known not to increase 
with age in certain special cases, such as mental defectives 
and some types of psychiatric patients. This is true and 
would be relevant if in the individual case chronological age 
tout court were the criterion: if one said ‘I haven’t time to 
test this child so T’ll just find out how old he is instead.’ But 
for the large-scale standardization with which this chapter 
1S Concerned, these cases are too infrequent to invalidate the 
a In fact, the Binet scale which is based on the con- 
a t of chronological age is one of the most effective means 

1agnosing the child who is either subnormal or abnormal. 
empresa age thus seems to me the most satisfactory 
We ee for test validation which is available for 
eee t fulfils most of the necessary conditions of a good 
erion of intelligence for subjects between the ages of, say, 
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five and fifteen. Under five, it is too difficult for the tester, 
who is not well acquainted with the child, to distinguish 
between ‘will not’ and ‘cannot’, Over fifteen, it cannot be 
assumed that mental capacity increases with increasing age 
(although as I have Suggested, it is unlikely that mental 
development comes to an abrupt halt at about that age). 
Chronological age is objective; it is reliable in every sense 
and is independent of the test or test battery; and there is 
no doubt that intelligence, however defined, does with rare 
exceptions increase as age increases, at least within the limits 
postulated. 

A further point in its favour is its demonstration of the fact 
that different people find different tasks difficult, The Binet 
Scale, as is well known, has been completely standardized 
with large numbers of representative children of each age 
group: the average child of ten ‘should’ be able to solve all 
the nine-year-old problems, more than half of the ten-year- 
old ones and few or none of the eleven-year ones. Yet it is 
not uncommon to find, in the course of testing, one child 


year tests yet passes some at the 
eleven-year level, for no obyi 


f Chronological age is, then, an invaluable criterion. The 
six-year-old is, unless mentally ill, more intelligent than he 
was at five and is less į 
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first, that rates of mental development may well be dif- 
ferential, as between individuals and as between different 
Periods in the life of one individual—as is the case in physical 
development. Secondly, that if test A has been validated, 
laboriously and successfully, against chronological age and 
test B is then validated, less laboriously against test A, some 
cumulative error is likely to be present by the time test C is 
validated against test B. 

We still have no satisfactory criterion for adults and we 
are left with only one from our original list: personal judg- 
ment (e). As already indicated, this enters at some stage and 
to some extent, into all the others with the exception of 
chronological age. Here, however, we are concerned with 
Straightforward subjective judgment, undisguised as an 
objective ‘pass-fail’ in an examination or as a statistically 
derived ‘factor’; we have, in fact, an admittedly psychological 
means of assessing something psychological with its inevit- 
able variability, vagueness and bias. Many and varied are 
the methods by which psychologists and others have 
endeavoured to overcome the difficulties of obtaining 
meaningful judgments on one individual by another,? but 
the inherent difficulties still remain whether the form of 
Judgment adopted be a five-point rating scale or a qualita- 
tive assessment, a series of rankings or of paired comparisons, 

he Personal judgment criterion may be used for adults 
and children; the judgments may be made by teachers or 
platives, by employers or employees, by fellow subjects or 
R the subject himself. None of these can be designated as 
i tual but it would be hard to find a completely impartial 
Judge who knew enough about the individual concerned to 
; z pm to estimate his intelligence convincingly. Certainly 
A anton studies have been performed on intelligence 
nd other characteristics) with judgments based on a single 


1 . 
Ronee instance, Heim, A. W., ‘Industrial Assessments: some Prob- 
Suggestions’ (1946), Occup. Psychol., XX, 1. 
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interview. Here the judges, if biassed, are so only on the 
basis of a short acquaintance. This may however be just as 
violent as a bias grounded on longer acquaintance and will 
very likely be still more unfounded. Available evidence 
suggests that assessments of intelligence based entirely on 
interview are among the most unreliable of estimates. This 
is probably due to the interaction of personality with 
personality: many characteristics of the interviewee and the 
interviewer are liable to vary concomitantly, and so 
dependent is this phenomenon on the particular rapport 
achieved in any one interview that even this variation is 
unlikely to be uniform. 

All personal judgments made by an individual on another 
are liable to partiality, to prejudice, to over-compensation 
and to unrecognized ignorance. The meaning the judges 
attach to the term ‘intelligence’ will vary, as will the back- 
ground against which they think fit to consider the subject. 
On the other hand to attempt to assess someone on intelli- 
gence, or on anything else, without having some background 
or population more or less explicitly in mind would render 
a hard task well-nigh impossible. The practice favoured in 
some quarters, of giving the judges systematic training before 
they make their assessment may result merely in obscuring 
important differences of opinion. It certainly increases the 


agreement between assessors but to infer from this increased 
soundness of judgment is an unwarranted (though sometimes 
taken) step. 


For personal judgment of the subject’s intelligence to be a 
satisfactory criterion, then, the judge must be self-consistent, 
impartial, clear-headed and well-acquainted with the sub- 
ject; he requires also a keen sense of proportion; it is an asset 
if he is perspicacious and, himself, reasonably intelligent. 
Whilst admitting that this combination is found compara- 
tively rarely, I should nevertheless claim a certain value for 
personal judgment as a criterion, There are some people 
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whose judgment of others we trust; if that judgment conflicts 
with other criteria, we tend to question the other criteria— 
and with reason, as has been shown. What is meant by 
intelligent behaviour, and the degree to which any one 
individual displays it, are in the final analysis matters of 
personal judgment. We find it at all points on the circle 
termed ‘test validation’. 


In view of the lack of a generally acceptable criterion, what 
grounds are there for designating certain intelligence tests 
as well validated?—or, less controversially, as better vali- 
dated than others? What is meant by the statement that 
intelligence tests are the most valid of the many types of 
mental tests? The answer is unsatisfactory because unprecise: 
it consists of a vast amorphous collection of data, accumu- 
lated along all the lines I have discussed. The general drift, 
whatever the criterion adopted in any one instance, has 
been in roughly the same direction. Whether intelligence 
test scores are compared with occupations or with problems 
solved by children at different ages; with results of examina- 
Hons or of other intelligence tests; with pen pictures or with 
aoe on set scales; the results tend to show association in 

‘he predicted direction. As Bridgman puts it: ‘,.. the actual 
Situation here is one of spiral approximation, as it so often 
1s’,1 
onan nae of standardized tests, despite their variation, 
ihe an ormation in an hour or less which can otherwise only 
Gi in some much more time-consuming way. 
Ciena. between intelligence test score and independent 
the men certainly occurs, but it is the exception rather than 
ithe e. Furthermore, in many cases of such conflict it is 

t resolved by the consideration of relevant data hitherto 


1 Bridgman, P. W., « - i w 
(1945), Psychol, hae ome general Principles of Operational Analysis 
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ignored (such as a partially deaf, high-scoring child, 
assessed as stupid by a member of the school staff who was 
unaware of the child’s disability) ; or the test score ultimately 
proves itself to be the more satisfactory long-term criterion 
(as has been demonstrated in accounts of child guidance and 
vocational guidance work, and of selection and allocation 
in industry and in the Services). i! 

There is a good deal of evidence that, given fairly 
unselected groups, intelligence—as estimated by standard 
intelligence tests—is the most important single factor in 
determining scholastic and occupational adjustment. The 
criteria for validating such tests, despite their shortcomings, 
are more adequate than those used for validating tests of 
a more emotive type, and the agreement in general is more 
convincing, 


* See, for instance: Vernon, P. E, and Parry, J. B., Personnel Selection in 
the British Forces (1949), University of London Press, London; and 
Rodger, A., ‘A Borstal Experiment in Vocational Guidance’ (1937); 


Rep. industr. Hlth. Bd., Lond., No. 78, H.M. Stationery Office. 
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Chapter Nine 


SPEED AND POWER 


A clear-cut distinction between ‘speed’ and ‘power’ in 
mental testing is frequently drawn by psychometrists.* They 
sometimes go so far as to designate a test quite simply as a 
‘speed test’ and another equally simply, as a ‘power test’. 
They rarely go so far as to define the two terms and in this 
they may be wise, for no definition will withstand examina- 
tion either of the underlying psychological assumptions or of 
the available experimental data. Consideration of both 
these suggests that the dichotomy is one of convenience— 
whose maintenance may at times prove very inconvenient; 
and that in those unusual circumstances which do permit of 
its maintenance, ‘speed’ and ‘power’ act and react upon one 
another in an intimate and not wholly predictable way. 
A typical discussion might run somewhat as follows: 


Psychometrist: It is quite clear. A speed test is one in 
which the quickest do best. 

Psychologist: Regardless? 

Met: How do you mean, ‘regardless’? 

Log: The quickest, regardless of whether they do the 
tasks correctly or not? 

Met: No, of course not. The quickest of those who get 
them right. You can either count the number correct in a 
Set time or record the time required to— 


Log: But if they have to get them right, it’s a power test, 
Surely? 


Tes See, for instance: (a) Slater, P., ‘Speed of Work in Intelligence 
S ec (1938), Brit. J. Psychol., xxix, 1; (b) Freeman, F. S., ‘Power and 
Poyeh i their Influence upon Intelligence Test Scores’ (1928), Je appl. 
in An +: XII, 1; (c) Ruch, G. M. and Koerth, W., ‘ “Power” vs “Speed 
rmy Alpha’ (1923), J. educ. Psychol., XIV, 4- 
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Met: No, because what you’re interested in is the speed 
with which they get them right. If it were a power test, 1t 
would be untimed or would have such a generous time limit 
that practically everybody would reach the end. 

Log: Well, it’s partly a power test. If it were just a test of 
speed you would only be interested in the number attempted 
—or the number correct would be the same as the number 
attempted. 

Met: In a power test, we're not interested at all in speed. 
What matters in a power test is accuracy. 

Log: As opposed to carelessness, you mean? 

Met: Well, no, as opposed to stupidity. The careless 
subject may be highly intelligent. 

Log: Which do you mean by ‘intelligent’—speedy or 
powerful? And, anyway, if you’re assessing intelligence by 
means of these tests, how do you know? 

Met: Intelligence is best measured by a reliable test of 
power. In such tests, as you will know, we make allowance 
for differences of speed, by grading the questions from easy 
at the beginning to difficult at the end of the test. 

Log: Are you suggesting that an objective unalterable 
gradient of difficulty can be achieved in an intelligence test? 

Met: Yes, of course. We put the easiest questions first and 
gradually lead up to the harder ones so that the quick, slick 
subjects will be penalized and the slow but sure subject will 
be more likely to do himself justice despite the time limit. He 
will have time to tackle only the easier questions—and will 
generally get them all, or nearly all, right. The subject who 
teads and thinks quickly and superficially, will find his 
speed gradually decreasing as he works through the test or 
will tend to increase his proportion of errors. The subjects 
who work quickly and accurately will gain the highest 
scores—which is in line with the normal concept of intelli- 
gent behaviour —and the subjects with the worst results will 
be those who are slow, and either stupid or careless. 


os 
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Log: That sounds very nice. I would agree with the last 
point. But I don’t think the gradient of difficulty of a test 
can be established as independently as you suggest—unless 
it’s an exceedingly steep gradient. In fact, it has been found 
that a difficulty gradient successfully determined on one 
group may turn out to be a very different shape on another 
group of subjects—either because of differences between the 
groups, or because of the practice effect produced within the 
test itself. 

Met: That might happen if the groups weren’t big enough. 
Anyway that trouble could be overcome by randomizing 
the questions in the first place—giving them in different 
orders to different subjects. 

_ Log: But it’s not just a statistical matter. For instance, the 
instructions with which the test is given may make a great deal 
of difference. Members of an intelligent but unsophisticated 
group taking the average sort of intelligence test will often 
work rather slowly simply because they are intelligent and 
critical: they like to check their answers or they feel there 
must be a catch somewhere. They will do themselves justice 
on such a test only if instructed to work as fast as possible. 
On the other hand, ifa group of dull subjects were given this 
instruction, they would probably fail to do themselves 
Justice. If you had a genuinely cross section of the popula- 
tion, the speed element would presumably affect different 
Individuals differentially. . 

Met: It doesn’t much matter in practice as it is well 
nown that speed and power tend to go together. On the 
whole, brighter subjects work faster than duller ones—and 

the quicker subjects tend also to be the brighter ones. 
0g: Doesn’t that depend on the particular relation 

* See, for i : $ atistical Study of Certain 
con at the Tine Panar dn olea i 927), “Teachers Coll. 
S : #duc., No. 248; (6) Peak, H. and Boring, E. G., ‘The Factor of 


op. eit nie ea (1926), 7. exp. Psychol., 1x, 2; (e) Ruch and Koerth, 
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between your group and your test? If, for example, the test 
questions are too easy to extend the subjects fully, then they 
will all tend to get the vast majority of their attempted 
questions right. If the test is short for the group concerned, 
as well as easy, you will just fail to get much discrimination 
of any kind; if it is long, then of course you will get close 
agreement between ‘speed’ and ‘accuracy’ because the 
‘accuracy score’ (number correct) will be almost identical 
with the ‘speed score’ (number attempted). It would be a 
very different story with a different relation between test 
and group. 

Met: You’re talking about an ‘accuracy score’ and a 
‘speed score’ now, though you objected to my distinguishing 
between power and speed tests. The distinction is quite 
clear once you establish the level of difficulty of test problems 
and the rate with which such problems are solved. One 
might almost assess the ‘power’ of a subject by seeing at 
what level of difficulty he begins to fail. 

Log: But (a) I can’t agree with this purely quantitative 
concept of ‘difficulty’. Your suggestion implies that what is 
harder for one subject is necessarily harder for the next sub- 
ject and this is often not the case. (b) Even ignoring these 
qualitative and individual differences, ‘difficulty’ is a very 
tricky notion to work with. 

Met: No, I don’t think so. You can estimate the difficulty 
of any question in your test by calculating the proportion of 
subjects who, given unlimited time, got it right. The problem 
which was correctly solved by 66 per cent of the group, for 
Instance, is more difficult than the one which was correctly 
solved by 72 per cent of the group. 

Log: Again, I have two objections, 
Say given unlimited time’, but the members of the group 
obviously never have an infinitely long time in which to 
tackle the test questions. If they solve the problem in a 
relatively short time, the subjects are presumably not fully 


In the first place you 
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extended and you will be unable to deduce from their 
performance, what you want to know. Perhaps they all 
would have solved the problems, given unlimited time—in 
which case we should have to reintroduce a time element 
in order to effect discrimination and in order to estimate the 
degree of difficulty of the respective questions. Might one 
not equate ‘cannot solve’ with ‘needs an infinitely long time 
to solve’? 

Secondly, the significance you attach to ‘difficulty’ is 
surely very arbitrary. You seem prepared not only to 
dispense with any criterion external to the test and the group 
but also to ignore the time taken by the ‘powerful’ subjects 
to solve the problem. If, for arguments sake, 60 per cent of 
the subjects give the right answer to questions 12 and 15, but 
you happen to know that the mean time required by these 
Subjects for number 12 was ten seconds and for number 15 
(similar in form) forty seconds, would you not say that 
question 15 was more difficult? 


Discussions of this kind if not liable to continue indefinitely 
fend to recrudesce, equally unfruitfully, whenever Psycho- 
logist meets Psychometrist. I believe, however, that a 
Senuine and interesting set of problems does exist and that 
they are susceptible of experimental investigation, although 

€y have not as yet been much subjected to it. ; 

n my view, the concept of difficulty is complex and it 
Constitutes the crux of the problem of defining and relating 
Speed and power. We need to examine the criterion of 

el (proportion correct) and of Log (time taken) and, 
above all, should not keep the results of these examinations in 
Separate compartments. Do problems which are difficult 
in Mets sense take a longer time to answer than those 
which are easy? Or a shorter time? Or is there no close 
relationship? Do subjects tend to be consistently slow and 
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consistently fast or do they vary between questions? Are some 
much more variable than others? Is the variability related to 
the difficulty (Met’s criterion) of the questions? Is an 
‘internal’ criterion of difficulty acceptable? Are we justified in 
defining it in terms of performance on tasks comprising a 
test which has been designed specifically to assess the ability 
to cope with difficult tasks?—especially since, in practice, this 
necessitates our limiting ourselves to simple, and similar, 
tasks? Is it legitimate to argue from results with untimed 
tests to timed tests? May not the imposition of a time limit 
affect the relative difficulty of the questions in a test? Do 
questions with a longer (or shorter) mean-time-taken tend 
to be answered correctly? Is there a significant difference 
in mean-time-taken between correct and incorrect answerers, 
fo given questions? And, if so, in which direction does it 
go? 

By timing individual subjects on individual questions, 
without the subjects being aware of this, it would be possible 
to gain information on these and allied problems. I know of 
few inquiries in which satisfactory data of this kind have 
been obtained and analysed! In the particular experiment 
I propose to discuss, ? the series of tests used consisted entirely 
of spatial perceptual tasks. There is, however, no reason to 
Suppose that similar results would not apply to tests of 
intelligence; and the complexity of the results suggests that 
further work along the same lines might throw new light on 
some of the psychological problems of mental testing, which 
have for so long been shelved, 

In this experiment, three of the six types of problems were 
of the multiple choice variety and three were of the ‘creative 


5 z A 

answer’ variety. The subjects were instructed to work at 

* 1 Sutherland, J. D.,“The Speed Factor in Intelligent Reactions’ (1934)> 
rit. J. Psychol., xxiv, 3- This interesting paper is one of the few- 

(Eighteen references on ‘speed and power’ are given.) 

5 Cane, V. R. and Horn, V., ‘The Timing of Responses to Spatial 
erception Questions’ (1951), Quart. J. exp. Psychol., m, 3. 
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their own pace. No straightforward relationship was found 
between the mean time taken to answer a question and the 
correctness of the answer. But when the variety of question 
was taken into account (that is, whether multiple choice or 
creative answer), it was found that correctly solved creative 
answer problems took significantly less time than incorrectly 
solved creative answer problems. This was not true of the 
multiple choice problems which showed in fact the opposite 
tendency, These findings were complicated further by 
differences between ‘good’ and ‘poor’ subjects. For instance, 
in this test series which was of a very high standard, the 
better (more highly scoring) subjects tended on the whole to 
work more slowly than the poorer subjects. The discrepancy 
between this result and that of most other investigators in 
allied fields suggests, once again, that sweeping generaliza- 
se on these topics are dangerous: that conclusions should 
S drawn only in terms of the type of test (timed or untimed, 
multiple choice or creative answer, etc.) and of the relation 

between the test and the group. 
mee more example may be given to illustrate the 
oe of drawing a clear-cut distinction between 
a and power tests. The data come from some experi- 
tés SA Ps the effects of repeated retesting on intelligence 
a ormance,? but here we are concerned with practice 
a only in so far as they affect the speed-power question. 
ipa experiments, four groups of different intellectual 
of nit took weekly one of two intelligence tests over a period 
ör re weeks. One of the tests, AH 4, was originally devised 
ES ed groups and the other, AH 5, was devised for 
a ed highly intelligent subjects. Progress curves for the 
ques re were drawn, showing (a) the mean number of 
ms attempted, (6) the mean number of answers 


Corre: : : 
ct and (c) the ratio of answers correct to questions 


1 Cane 
Ur ‘ane, V. R. and Heim, A, W., ‘The Effects of Repeated Retesting: 


(1950) = Quart. J. exp. Psychol, it, 3. 
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attempted, for each of the nine testings. A comparison of the 
four sets of progress curves revealed that the same test, given 
with the same instructions, time limit, etc., could be for one 
group primarily a test of speed—in the sense that those 
attempting most questions automatically gain the highest 
scores—and for another group, primarily a test of power— 
in the sense that some of those attempting many questions 
obtain lower scores than some of those attempting few ques- 
tions. This was shown most clearly in the two sets of AH 4 
progress curves: the brighter group consistently maintained a 
very high ratio of correct to attempted (between 80 per cent 
and go per cent), despite a steep rise in number attempted 
over the first five weeks. For this group, the progress curve 
showing number of answers correct was in fact almost 
identical with that showing number of questions attempted. 
The picture for the duller group on this same test was very 
different: their ratio of correct to attempted remained consis- 
tently as low as 50 per cent all the time, although their 
speed (number attempted) increased continuously through- 
out the experiment. 


The same experiment demonstrated how the position of a 
testing in a given series may radically alter the speed-power 
emphasis of the test. For example, test AH 5 proved for its 
highly intelligent group to be more of a test of power (or 
accuracy) at the beginning of the testings than it did towards 
thetend: This was not true of the same test taken, in the 
same conditions, by less intelligent subjects. They evidently 
did not appreciate the difficulty of the test questions and 
concentrated on increasing their speed, at the expense of 
accuracy. Only when the majority of these subjects were 
attempting the majority of the questions did their (initially 
very low) ratio of correct to attempted begin to rise. Thus, 
many questions answered incorrectly by a subject in an early 
trial were answered correctly by the same subject, at a later 
trial, again effectively disposing of the notion that individual 
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test questions can have some inherent quality—be it of 
speed or of power—regardless of context. 

Finally, there is some evidence from these and other 
investigations that people tend unwittingly to adapt (up- 
wards or downwards) to the level of difficulty of the tasks 
they meet—a tendency not confined to laboratory situations. 
If further work confirmed this hypothesis, it would reaffirm 
the oversimplification of the speed-power distinction. 


Having attempted to undermine the existing scaffolding, 
I should perhaps make some constructive suggestions. At 
present the ground can only be prepared for the first 
foundations: the structure will consist of data from experi- 
ments not yet even designed. Some discussion of terms, 
however, may be useful, if only to facilitate the planning of 
future work, 
? We are concerned with the concepts of ‘difficulty’, of 
Power’ and of ‘speed’. I hope to have established the 
Speciousness of treating any of these as self-contained 
absolute qualities attributable to mental tests or to the 
Subjects taking these tests. But the terms are not without 
value, provided that their meaning and limitations of 
meaning be outlined, and the extent of their interaction 
emphasized, The concept ‘difficulty of a problem’ is less 
Controversial than the concept ‘difficulty of a test’. The 
ormer may be estimated in terms of time taken (by a given 
8toup or individual) to solve the problem; or of proportion 
ofa Particular group) solving it (in given conditions) ; or of 
level of intelligence (assessed independently) of the sub- 
group solving it, 
he ‘power’ of an individual may be estimated by means 
of a test, given without a time limit (provided the test 
quaciently extends the subject); or of level of difficulty of the 
lardest problem’ (of a given kind) that he can solve; or by 
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the speed with which he solves difficult problems (of known 
standard). 

Speed may be estimated in terms of number of questions 
attempted, or of number of answers correct, in a given time; 
or in terms of the time taken to study, or to answer correctly, 
a given number of questions, of known difficulty; or in terms 
of the ratio ofnumber attempted to number correct achieved 
in a given time. (‘Attempted’ may be defined cither as 
having been answered, rightly or wrongly, or as having been 
considered by the subject.) 

It will be seen that none of the three terms has a precise 
meaning; that the choice of meaning for any one of them 1s 
arbitrary since the alternatives have little in common; and 
that they are inextricably interwoven, one with another. All 
the definitions of speed are open to the objection that they 
assume subjects to be slow, and test questions time-consuming 
always for the same reason. There is little doubt that incen- 
tives, both internal and external to the test situation, play a 
role in determining the subject’s rate of work as well as his 
accuracy; and that time-demanding questions may be so for 
one or more of a number of reasons—for instance, complexity 
or unfamiliarity of the thought process, number of items in 
the question requiring mental manipulation, number of 
offered solutions requiring rejection, etc. Until the ambi- 
guities are admitted and the general topic recognized as 
worthy of planned experimentation, the vague yet sweeping 


generalizations which have flourished for so long will 
continue to do so, 


Se 


Chapter Ten 


INTELLIGENCE AND ENVIRONMENT 


‘If a genetic analysis by correlation study were applied 
uncritically to such a thing as language, it is obvious that 
we should be able to show that native language was 
apparently a hereditary characteristic.” McV. Hunt’s 
Personality and the Behaviour Disorders (1944). Chapter by 
L. S. Penrose, entitled ‘Heredity’, page 517- 


Until recently, most books on one or other of the biological 
Sciences contained some such chapter heading as ‘Heredity 
and Environment’ or ‘Nature and Nurture’. This is true 
€ven now of the majority of books written on psychology, or 
specifically on intelligence. Such a dichotomy immediately 
Suggests that the two are separate and quantifiable, and that 
any interaction between them is likely to be uniform; and 
the contents of such chapters are apt to confirm this 
impression. It is largely on these grounds that I have chosen 

> title ‘Intelligence and Environment’—for a chapter 
which might with as much and as little appropriateness have 
been called ‘Intelligence and Heredity’. 

It will be concerned mainly with discussion of the con- 

icting conclusions reached by psychologists in this field and 
with suggestions for a less ambiguous terminology.’ It will 
pede also some consideration of the effects of cultural 
packground, of socio-economic status and all that goes with 

> oF , education and of test-familiarity, on intelligent 
chaviour and on intelligence test performance. 


* The established facts are admirably stated and discussed in L. S. 
F aeSe’s The Biology of Mental Defect ee 949), and in J. B. S. Haldane’s 
pa le of Nature and Nurture’ (1946), Annals of Eugenics, vol. xm. 
the Pon, P. E. Vernon, ‘Psychological Studies of the Mental Quality of 
OpulRiea’ (1950), Brit. J. Psychol., XX, 1. i 
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My first aim is to expose the absurdity not merely of such 
statements as ‘the total contribution of innate and heritable 
factors is probably not far from 75 or 80 per cent’! but of 
the question ‘how far is x due to heredity and how far to 
environment? The confusion is perpetuated owing partly 
to political and social prejudices, and partly to inaccuracy 
in the use of certain key words. For example, ‘innate’ 
is.often used synonymously with ‘inherited’ but many 
innate individual differences are due, and known to be 
due, to intra-uterine influence. It would obviously be a 
misuse of language to refer to these as ‘hereditary’; on 
the other hand the antithetical terms, ‘environmental’ and 
‘acquired’, are conventionally applied to post-natal occur- 
rences only. 

This then illustrates a confusion due to unprecise termi- 
nology. My second example owes perhaps more to prejudice: 
it is well known that the children of comfortably- and better- 
off parents tend to score more highly on intelligence tests 
than do the children from needy homes.? This fact has been 
eagerly accepted by ‘progressive’ people as an argument for 
the supremacy of environment: given better food, warmer 
clothes, more books at home, etc., the poorer children 
would clearly match, if not outstrip, their more fortunate 
fellows, on intelligence test performance. The same fact 
has been acclaimed by the more conservative-minded as 4 
proof of the supremacy of heredity and of the rightness of 
the existing order of things. Wealth and ability naturally 
tend to go together in adults and naturally both get handed 
down to their children. The ‘haves? inherit from the ‘haves’ 
whether it be a question of library or ‘character’ or intelli- 


gence. For these people, equality of opportunity consists 
1 Burks, B. S., ‘Th ; e 
Meita iDeves ene Relative Influence of Nature and Nurture up’ 


Edur, Part I. pment’ (1928), Twenty-seventh Yearbook Nat. Soc. Stu- 
Burt, C., ‘Ability and Income’ (1943), Brit. J. educ. Psychol., x, 2- 
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essentially in affording individual children the same 
opportunities as those which their parents enjoyed. 

Taken to its logical conclusion, the dichotomy leads to 
such absurdities as the following: Tommy is now sixteen and 
taller than both his parents, therefore height cannot be an 
inherited (or innate or congenital or hereditary) trait; it 
must be acquired. Or the argument may run, with equal 
cogency: tall, sixteen-year-old Tommy has always lived at 
home with his short parents. His height cannot therefore 
be attributed to environmental factors, it must be an inheri- 
ted trait. It is not difficult to find assertions about complex, 
intangible qualities, scarcely less absurd than these, stated 
in all seriousness. 

The current controversies on the alleged decline of the 
Population’s intelligence and the association between family 
pice and intelligence level owe their inconclusiveness, in my 
opinion, largely to these two factors—of personal bias and 
verbal imprecision. The question in any one instance, how 
much is due to heredity and how much to environment is 
an impossible one to answer, not because of lack of informa- 
tion but because, in that form, it is meaningless. 

With regard to intelligence, many psychologists treat the 
problem as primarily one of definition. They maintain that 
intelligence tests assess intelligence and that that which 
intelligence tests assess is innate, that is, it matures in 
accordance with a pattern determined at birth. If what is 
assessed is shown to depend on education, or to be suscep- 
tible to training, the test is thereby a less good measure of 
intelligence, The better the test, the less related are its 
ERN to the subject’s environment—by definition. “This 
t ainly simplifies matters at one level but it complicates 
ses a good deal at other levels. Apart from the logical 

Jections to such a definition, it proves indefensible on 
Herc grounds: its sponsors would be forced to conclude 

at th&_ is no such thing as intelligence, or a test of 
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intelligence, since all the relevant inquiries have indicated a 
practice effect in mental tests comparable with that found 
in many other human skills. . ; 

It has been argued that since intelligence is one ‘innate 
quality, and a negative correlation has been found between 
family size and intelligence test score (based on siblings, not 
on offspring), the intelligence of the population generally 
must be declining. This prediction was in fact not cohfirmed 
by the results of the second Survey of Scottish School- 
children’s Intelligence,! which showed, on the contrary, 4 
slight overall rise in intelligence as assessed by standard tests. 
The results are instructive from several points of view: they 
hold especial interest for us at the moment, in the light of our 
criticisms of the innate/acquired dichotomy. 

The division which best allows the facts to be classified, 
without distortion or neglect, is that of Penrose.? He distin- 
guishes between genetical on the one hand and environ- 
mental on the other, further subdividing them as follows: 


Remote 
Genetica 


Recent 


Early prenatal 
7 rae: prenatal 
Environmental 


— 


Intra-natal 


Post-natal 
_ Thus the main difference between ‘nature’ and ‘nurture’ 
isin terms of time. The possession or lack of a trait, the degree 


1 Thomson, G., The Trend of Scottish Intelli ) 
re igence (1949). 
* Penrose, L. S., The Biology of Mental Defect (19% 5; 
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to which it is eventually displayed in the individual's 
behaviour or appearance, and its complex relations with his 
other traits, may be determined at an earlier or later stage. 
In fact ‘determined’ implies too rigid and irrevocable a 
Process: it is the setting of limits rather than the positive 
establishing of a characteristic that takes place at a given 
Stage of development. This setting is essentially of upper 
limits: an exceptionally unfavourable concatenation of 
circumstances may produce, at say twelve (months or years), 
an individual who has failed signally to fulfil the ‘promise’ 
he gave at one (month or year). The converse, ‘overfulfil- 
ment’, is far less common. Moreover, the stage at which the 
upper limits are set is unlikely to be at one particular moment 
of time; it probably stretches over a considerable period, 
whose length is not constant for all traits. 

Thus the stage at which the individual’s potentialities are 
determined varies, as does the time taken over determining 
them. There is variation, too, in the width of the limits laid 

‘own, that is, the nature and the extent of the individual’s 
Teaction to later influences varies with the trait—and also 
with the individual. 

_ An example of this variation, chosen from the latest stage 
in Penrose’s classification, post-natal, would be reaction to 
reat mental strain, such as is sometimes imposed, for in- 
Stance, on officers in war-time. Major A finds the particular 
Strain to which he is subjected intolerable: he develops a 
Psychosis which takes him out of the army and into a mental 

Ospital, and which his psychiatrist believes he would prob- 
ably not have developed had he remained in his original job 
as a bank clerk. Major B who found himself in almost 
identical circumstances during the war remained sane; he 
evidently would require some extraordinarily intense and 
Prolonged strain, perhaps of a different kind, before 

coming a psychiatric case. 

An Bie fraticn of determination at the immediately 
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preceding stage in Penrose’s classification, the intra-natal 
stage, might be the production of a feeble-minded child who 
sustained a head injury at birth. This example would how- 
ever lead us back towards the controversy we are trying to 
avoid, since it might well be argued that the injury was due 
primarily to a constitutional weakness of the baby dating 
back to an earlier stage. An uncontroversial example of 
determination as early and as complete as is ever found, 
would be sex-determination. 

However, it should now be clear that the differentiation 
between ‘stages’ is largely arbitrary. Penrose has elected to 
name six stages: he might have made out nearly as good a 
case for seven or for five. The point which he has brought out 
and which is of paramount importance is that we are con- 
cerned with a continuum. 

His classification makes clear the futility of adhering to a 
simple dichotomy when discussing the inheritance of intelli- 
gence, and the effects of environment on its development in 
the individual. Differences in time may be measured in big 
or in small units; differences effected at Time 1 influence, 
and to some extent determine, differences at Time 2. Thus 
to speak of ‘interaction’ between nature and nurture implies 
a separation which does not exist, and which is impossible 
to maintain either in theory orin practice. If for convenience 
the two terms be temporarily accepted and interaction 
between nature and nurture be postulated, we find ourselves 
committed to the assumption that the interaction is uniform. 
This is patently untrue: individuals vary enormously in their 
reactions to obvious environmental stimuli. Uniform inter- 
action leads naturally to the concept of quantifying the pro- 
portions contributed by both, a mistake which we have 
already considered. 

Finally, a simple distinction of kind implies that the trait 
under consideration is simple: this is clearly as untrue 
of intelligence as it is of most other characteres that 
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particularly interest the psychologist. Intelligence is not 
analogous to height or to eye-colour; it differs in at least 
three relevant respects. First, its connotation is complex 
and controversial, even when we restrict ourselves to con- 
sidering the fully developed intelligence of the adult. 
Secondly, it is not always easy to ascertain what it is that is 
genetically controlled even in instances when the final out- 
come is relatively clear: the characteristic manifested in the 
individual is sometimes found to be very different from that 
which is genetically determined. For example, the gene 
producing phenylpyruvic amentia plays a specific and local- 
ized part in the metabolism of phenylalanine: the disease 
manifests itself, however, as a form of low grade mental 
deficiency, not as an obvious biochemical abnormality. 
Thirdly, the ‘normal distribution of intelligence’, on which 
so much has been based, does not and cannot cover all that 
is known of intelligence and its inheritance. High grade 
mental defectives and borderline defectives take their place 
in orderly fashion towards the extreme of the normal curve, 
followed by the imbeciles and idiots. But the latter—the 
Severe mental defectives—are for the most part genetically 
distinct from the rest: idiots are not fertile whereas the 
feeble-minded are notoriously prolific. 

At the other end of the scale it may be interesting (if not 
altogether relevant) to ask whether those acclaimed by 
Posterity as men of genius would in fact have gained 1.Q.s 
of 150-200 on standard intelligence tests. There is some 
evidence of the continuity of intelligence and of a genetical 
Contribution (whose proportion is unknown): the mistake has 
been to argue from these facts that intelligence is a single, 
peers quality and that the amount possessed by any one 
ndividual can be adequately expressed in terms of his 
Position on the normal curve. 
oS least controversial data on the influence exerted by 

viront on intelligence are based on investigations 
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carried out with pairs of monozygotic twins.1_ Most other 
inquiries are open to the objections in reasoning outlined 
a few pages back. However, even identical twins are not 
foolproof since their environments are likely to be more 
similar, in important psychological respects, than are those 
of ordinary siblings and of dizygotic twins dissimilar in 
appearance. Moreover, much of the early work on twins 
suffered from the difficulty of determining with virtual 
certainty which pairs of twins were in fact identical. 

The evidence on the intelligence of twins may be sum- 
marized as follows: 

(a) The closest agreement has certainly been found 
between monozygotic twins brought up in the same 
environment. Even here, however, the correlations do not 
reach unity, but are of the order of 0-8-o-g. This suggests 
that some non-genetic factors may influence the fairly close 
association found between ordinary siblings, on performance 
in intelligence tests and in non-laboratory situations. 

(b) When young monozygotic twins are separated, their 
intelligence test scores are not nearly as close as those of 
monozygotic twins brought up together. In fact the differ- 
ences between them are as great as those between un- 
separated dizygotic twins. 

(c) The differences in intelligence found among separated 
monozygotic twins tend to go, naturally enough, in the 
direction of environmental differences, assessed by such 
criteria as foster-parents’ occupational and economic status. 
It should be borne in mind, however, that adoption tends 
to be selective. 

_ (d) The later in life the separation occurs, the more similar 
is the intelligence of the twins, however assessed.2 

It would appear then that the case for the puissance of 


1 (a) Newman, H. H., Freeman, F. N. d Holzi K Twins, @ 
RA inio, “and Sa e (1937) ; E) E IK, ‘The 
e versus Nurture Problem’ (1 . educ. Psychol., XXV¥ 

* Hogben, L., Nature and N Sabha J: oes Peha Ai 
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heredity can be overstated. The tendency of psychometrists 
has been to overstate either one case or the other and, in 
their keenness to stress the immutability of intelligence test 
results and the desirability of using mental tests predictively 
in many spheres, they have more often over-emphasized the 
genetical than the environmental influence. It is common to 
hear of ‘culture-free’ tests; to read of investigations which 
result in ranking different nationalities, or colours, or races, 
or classes, in order of degree of intelligence; to find a rigid 
distinction drawn between, for instance, ‘information’ 
vocabulary tests and ‘innate capacity’ diagrammatic tests. 
All these practices presuppose that the nature/nurture 
dichotomy is valid in general, and that it applies to “ntelli- 
gence’ as a particularly felicitous example of ‘nature’; that 
this isolated and innate ‘intelligence’ can be measured with- 
Out reference to the background in which the individual has 
town up; and that the measurer is in a position to dissociate 
himself from his own background and to devise tests with 
Universal application. The fact that the nationality, colour 
and race of the tester tends to head the list in such inquiries, 
is evidently taken as confirmation of the rightness of things. 

It is clearly true that certain tests depend more on general 
education than others and that some are more closely related 
to the relevant cultural pattern than others. But it is true 
also that the extent of this dependence is not always easy to 
gauge and, therefore, that conclusions concerning either 
individuals or groups from different classes or nations Or 
cultures should be drawn with the utmost caution. Were it 
Possible to devise any tests which were genuinely ‘culture- 
rick their value would probably be exceedingly limited and 
: €ir results would bear little relation to that which is usually 
Considered intelligent behaviour. 


3 “Warburton, F. W., ‘The Ability of the Gurkha Recruit’ (1951), Brit. 

ofiie 4 iol., xin, 1 and 2. In this paper, an excellent account 1s giyen 

and € difficulties of administering conventional tests to primitive peoples 
Si fatiyer of over-simple interpretation of test results. 


Chapter Eleven 


FLEXIBILITY VERSUS RIGIDITY 


Many of my criticisms of intelligence tests as currently used 
could be summarized by saying that they are too rigid and 
inflexible both in their requirements of the test-subject and 
in their interpretation. This is especially true of group 
intelligence tests, largely because of the attitude of the 
psychometrist. He argues that in difficult cases individual 
tests should be given and that individual testing is far more 
skilled and requires a good deal more training than group 
testing. In fact, he often behaves as though the purpose O 

group testing is to obtain data suitable for statistical analysis 
and that he is thus justified in ignoring the finer points of 
testing technique, when using group tests. 

„Under ‘testing technique’ I should like to include motiva- 
tion. Generalizing on this topic is dangerous since children 
and adults, manual and clerical workers, the sick and the 
healthy, are likely to respond very differently to the test 
situation as such, no less than are the individuals comprising 
these groups. However, I may perhaps generalize to the 
extent that the subject should have a reasonably strong, but 
not too strong, incentive to do well in the test. His attitude 
should resemble the African houseboy’s attitude to his 
boss’s bottle of whisky: ‘Just right. If it had been any 
better, he wouldn’t have given it to me; if it had been any 
worse, I couldn’t have drunk it. It was just right.’ 

If the subject feels that everything hangs on his test per- 
formance—that he will be letting down himself, his family 
and his future if he fails—then (even if this picture be 
unwarranted) he will very likely do himself less than justice- 
On the other hand, if he is bored in anticipation at the 
thought of the test and it fails to gain his interest when he 
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undertakes it, he is equally unlikely to do himself justice. He 
may obtain a score similar to that of the over-anxious, Over- 
motivated subject; he may or may not be similarly gifted; 
he will certainly gain that score for quite dissimilar reasons. 
It is not possible entirely to control the attitude of the 
members of a group taking a test, but the psychologist can 
contribute a good deal to his subject’s enjoyment—if he 
considers it worth while bearing this in mind when devising 
his tests, choosing them for the group and administering 
them. It follows from this, and from what was said earlier 
about ‘observer error’ and other sources of unreliability, that 
some element of continuity is desirable from the designing of 
a psychological test to the administering and scoring of it. 
Unfortunately such continuity is becoming increasingly 
Tare, specialization having grown as prevalent in psycho- 
metrics as in medicine, with the same tendency to ignore the 
existence of the individual subject or patient as an organic 
and psychological whole. i 
Assuming the test to be already standardized and the group 
selected, the tester has a considerable part to play in enlisting 
the subjects’ co-operation and interest during the actual 
testing. The competent, sympathetic tester will allow more 
time than is strictly demanded by the particular test he is 
using. He will take time at the beginning, far from waste- 
fall Y, to enable the subjects to relax and to get to know him 
7 little, He will do this partly by explaining the point of the 
testing to his subjects: this serves the double purpose of 
peoo their co-operation and, often, of calming some of the 
ce x of the uninitiated. It is desirable, too, that he take time 
ae e end to enable them to express their views and feelings 
out the test and to discuss it, however briefly, with him. 
a oe apa however, and ee ar s nè 
Geta y subjects to complete pre iminary mp 
ens est problems. They must have time to do so at leisure, 
rectly; understandingly, and therefore reassuringly. 
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The subject who is slow to accept a new principle is 
not necessarily slow to solve problems in accordance with 
the principle, once he has grasped it; and the one who is so 
tense that he is unable to think or to hear clearly, is often 
able to work through to relaxing and understanding, by way 
of tackling introductory, untimed examples. However, 
these are not only for the sake of the slow starter and the 
nervous subject: examples help considerably to overcome 
the effects of differential test-familiarity among members of 
the group.! Furthermore, they afford the critical and the 
non-suggestible, who may not immediately accept the parti- 
cular conventions of the test, the opportunity to state their 
objections and to learn, if they will, just what those conven- 
tions are. 

If the examples are to fulfil their functions, they must be 
of roughly the same standard as the questions in the test 
proper and the subject must complete them with a minimum 
of aid. It is sometimes claimed that preliminary examples 
do little towards equating subjects in respect to differential 
test-experience.* This is true when the examples are so easy 
that they provide no challenge to the subject and they 
inculcate a misleadingly low level of aspiration; or when too 
few examples are given; or when the major part of the work 
of completion is performed for the weaker subjects by the 
tester—if indeed the solutions are not already printed 
along with the questions. 

In such cases, preliminary examples are not very helpful 
and it is likely that the validity as well as the consistency of 
the test may suffer. But if the devising and presenting of 
examples be treated as an essential of the test, a part- 
determinant of set, of motivation and of understanding, the 
time spent on them will prove rewarding. 


a Wallace, J. G., ‘Results of a Test of High Grade Intelligence Applied 
to a University Population’ (1952), Brit. J. Psychol., xım, 1. 
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The material conditions of satisfactory group testing are 
usually less neglected than are the intangibles, discussed 
above. Lighting, heating and seating are generally adequate 
in test-rooms, though the acoustic properties of the bigger 
rooms vary considerably. It is not always realized that 
people at the back ofa large hall may fail to hear some of the 
instructions nor that the form and order of the instructions 
have a marked effect on the subjects and their responses. 

The instructions for a test should be uniform but the 
tester should not sound automatic when giving them for the 
twentieth or two hundredth time. They should be as brief 
as possible but should include all that the subjects want to 
know and may know, such as the time limit, when to turn 
over, how to record their answers. They should be given at 
a speed which does not flurry the slowest, and does not 
infuriate the quickest member of the group. 

To revert to the first paragraph of this chapter: perhaps 
the reasons are now emerging for my unwillingness to agree 
wholeheartedly with the tenet that little or no attention 
need be paid to technique in group testing whereas a great 
deal is demanded in individual testing. The latter presents 
none of the difficulties of effecting a compromise between 
the needs of the quick bright subject and the slow dull 
subject; none of the dangers of ignoring an insignificant- 
looking silent subject, desperately needing help; in brief, 
none of the problems of establishing rapport with the mem- 
bers of the group—as individuals, in addition to as a group. 

The individual tester, unlike many group testers, knows 
that his testing will be of little value if he fails to establish 
rapport, but he has been (rightly) taught to regard this 
Process as part of the test situation. He, too, will encounter 
difficulties but they will be different ones. The individual 
tester must observe the subject’s behaviour all through the 
test but must at no point inflict on the subject any feeling of 
self-consciousness or embarrassment. The tester must time 
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and record the subject’s responses, cope with the apparatus 
if any and make an occasional note, while all the time 
keeping up the appearance of a game (with a child) or a 
rather unusual conversation (with an adult). If, on the other 
hand, he inadvertently establishes too close a rapport, the 
individual tester is likely to find himself in the position of 
father confessor and psychiatrist in chief, being regaled with 
intimate biographical details and consulted on topics 
ranging from the subject’s test performance to his proneness 
to lose friends and alienate people. 

However, such confidences are irrelevant only if the 
tester regards himself as a specialist whose job is simply to 
ascertain the subject’s intelligence, and who regards intelli- 
gence as some single quality, measurable in vacuo, and un- 
affected by the subject’s emotional disposition or need. 
Whilst I should agree that, ideally, individual tests should 
be given, especially when the outcome of the testing is of 
great importance to the subject, I should argue that group 
testing ought to be regarded as an equally skilled job and 
that group testers should be trained as carefully as are the 
more ‘clinical’ individual testers. 

Earlier in this book (particularly in Chapters VII and 
VIII) I have contrasted the cognitive with the orectic. 
This distinction is highly artificial and can only be defended 
—if at all—on grounds of brevity and of convenience. In 
practice it is well known that these two attributes are merely 
aspects of personality—and intimately related aspects. Try, 
for instance, to find a group of mentally defective children 
for an experiment: the group will almost certainly be found 
to include a big proportion of children with behavioural 
defects of one kind and another. Itis virtually impossible to 
find a group of ‘pure’ mental defectives, that is, one whose 
members are ‘normal’ in all respects save I.Q. Conversely, 
many individuals with low I.Q. and high ‘stability’ would, 
rightly, not be classified as mentally defective at all, whether 
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the relevant definition be a legal or a social one. Or consider 
the effect of emotional tension on people who usually behave 
intelligently and deftly: one loses his ability to concentrate 
and ceases to hear what is said to him; another, who appears 
otherwise unaffected, suddenly becomes clumsy and finds 
himself dropping things or knocking them over. 

To say that our emotional and our intellectual experience 
and behaviour are intimately related is a truism. To stress 
this fact when discussing mental tests, however, may not be 
platitudinous, nor need it detract from the undoubted 
value of such tests. On the contrary, its explicit recognition 
by psychometrists should in the long run establish their tests 
on a surer footing than has been achieved by making 
extravagant claims for the tests. 

It is not fortuitous that the two greatest obstacles to 
appraising intelligence with accuracy are its inherent com- 
plexity and variability and its lack of adequate criterion. In 
the weakness of the various criteria lies something of their 
Strength: were they more satisfactory qua measures they 
would perhaps convey an unwarranted sense of simplicity 
and consistency, We cannot conscientiously rank occupa- 
tions or assessments based on interview, for example, one- 
dimensionally from highest to lowest; and the difficulty is 
Not resolved by allowing a great many ‘ties’ in the ranking. 
It is the difficulty of including red-heads in a golden-to- 
brown hair colour scale, of comparing salmon with grapes 
m a preference scale, of arranging varied problems in 
Order on a difficulty scale. Certainly such tasks are 
attempted—notably by psychologists—but only a highly 
Sophisticated or exceedingly naive judge will be happy 
making such judgments or contemplating the results. 

4 ith the exception of chronological age (which should be 
ea to children only) and of previously standardized 
> the criteria used for validating intelligence tests are not 


rigid anu inflexible. Even factor analysis does not yield a 
Io 


138 Appraisal of Intelligence 


unique solution, since the outcome depends first on the 
particular statistical method chosen and secondly on an 
arbitrary psychological interpretation of the statistical 
findings. However, this ‘flexibility’ is ofa different kind from 
that which I have been considering. 

The type of criterion which is the vaguest and the 
furthest from masquerading as objective is the personal 
judgment. At the extreme end of this type is to be found 
the interview assessment, a judgment of qualities which 
may or may not exist, many of which vary with the inter- 
viewer and all of which are based on inadequate data. The 
interviewer’s greatest problem is to get the subject to speak 
and behave freely and revealingly, without himself giving 
any lead as to what line of talk and what behaviour would 
be most acceptable. The interviewer must acquire the 
maximum of information but should refrain from asking 
leading questions; he must induce a feeling of confidence on 
the part of the subject and a certain feeling of equality, 
while not being too forthcoming about his own views and 
tastes; he must be interested but not effusive, unsuggestive 
but not wooden. He should never behave unresponsively 
yet, equally, he should not express a decisive judgment— 
whether laudatory or condemnatory—on anything the 
subject may say or do. The interviewer should withal retain 
the appearance and, if possible, the reality of some 
spontaneity. 

In fact, his task is very similar to that of the individual 
tester. There is little point in adopting the old-fashioned 
system of intimidating an interviewee and ‘seeing how he 
stands up to it’, because if he stands up to it badly the inter- 
viewer will not know whether to explain the failure in terms 
of the subject’s reaction to the interview or of his intrinsic 
personality. On the other hand, a subject who takes it well, 
may merely be insensitive. In the same way, an unsy™m- 
pathetic individual tester will only discover how poorly his 
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subject can score on a given test battery; and this evaluation 
will in fact be far from accurate since people vary enormously 
in their susceptibility to testing conditions. Even in the most 
satisfactory circumstances the tester can usually do no more 
than estimate the lower limit of his subject’s capacity: ‘he is 
at least as good as this’ should be the form of the appraisal. 

The aim should be a positive one: to ascertain what the 
subject can do, and what he does best, rather than to demon- 
strate his incapacity for this or that, The grounds for this 
suggestion are not charitable or ethical but rather scientific 
and practical. If the subject has gained such and such a 
score, he clearly has the ability to do so, but—for the sort of 
reason discussed under ‘unreliability’—he may have the 
ability in certain circumstances to do better. The only sub- 
ject who is likely to do himself more than justice is the one 
who is familiar with identical or similar tests. 

Whatismeant by‘ doing himself more (or less) than, justice’? 
The phrase implies that there js a certain level of intellectual 
ability which the subject is thought to possess, on some 
evidence other than his present test score. It might be 
school record or occupational status oF judgments of friends 
and colleagues; in fact, it might be any or all of the criteria 
against which tests are validated in the first place. ‘But,’ it 
may be argued, ‘these criteria take into account a lot more 
than pure intelligence: they are all influenced more or less 
obviously by the subject’s general personality, by his tem- 
perament and character. Intelligence tests are less confused; 
they do not estimate, and make no attempt to estimate, any- 
thing except the subject’s intelligence. Irrelevancies such 
as his persistence, his sociability and his sense of humour are 
cut out—unless, of course, some supplementary tests of 
Personality are administered.’ 

This is what I meant when I suggested that the weakness 
of the criteria for test validation was also their strength. 
Qualities of a non-intellectual kind play an important part 
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in all criteria of a biographical nature, and it is the latter 
which have significance outside the psychological laboratory. 
If we revert to Thorndike’s ‘first approximation’, we find 
that he defined intellect ‘as that quality of mind (or brain or 
behaviour if one prefers) in respect to which Aristotle, Plato, 
Thucydides and the like, differed most from Athenian idiots 
of their day, or in respect to which lawyers, physicians, 
scientists, scholars and editors of reputed greatest ability at 
constant age, say a dozen of each, differ most from idiots of 
that age in our asylums’. 

A differentiation as broad as this would probably not 
be rejected or even criticized by contemporary psycho- 
metrists—save perhaps on grounds of crudeness. Yet the 
differences between Thorndike’s groups would undoubtedly 
be found to embrace differences in temperament and 
character no less than differences in intellect. It is true that 
he stipulates the quality of mind which shows most difference 
between his three groups but there can be little doubt that 
his eminent lawyers and editors would differ from the certi- 
fied idiots almost or quite as much in emotional disposition 
and interests, as in intellectual capacity. 


Chapter Twelve 


SUGGESTIONS FOR FURTHER WORK 


Current research in mental testing is confined almost 
exclusively to factorial work. As I have suggested, this often 
involves internal criteria for validation and inadequate care 
in designing and administering tests. Moreover, most of the 
techniques as commonly used imply certain unacceptable 
corollaries, notably the existence Sn the mind’ of a number 
of separate ‘factors’ or faculties and, at the stage of inter- 
pretation, a god-like insight as to what these are. To the 
majority of psychologists, further research in mental testing 
denotes the devising of new tests in accordance with factorial 
principles, the refinement of highly specialized statistical 
techniques and, perhaps, the pursuance of animated con- 
troversy over such existing techniques. 

I should like to suggest a few research projects along 

: non-factorist lines, which seem to me worth while. Such 
studies would at least help to crystallize certain factual 
Problems; at best, they would provide fresh psychological 
material, 

One of the central concepts in mental testing is, or should 
be, the concept of difficulty. In psychometric literature, the 
Interest and the ambiguity attaching to this term have been 
minimized: the ‘difficulty’ ofa problem is determined simply 
and objectively by calculating the proportion of subjects who 
Sive the right answer to the problem. Degrees of rightness 
are automatically eliminated by the form of presentation of 
the question and answer and, similarly, all wrong answers 
are equated. With rare exceptions, the time taken by 
individual subjects over individual questions is ignored. The 
ew experiments which have been conducted on this topic 
have béen unsatisfactory because the onus of time-recording 
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has been, more or less directly, on the subject'—thus 
substantially altering the total test situation. 

It would be interesting to make some investigations on 
intelligence tests in which the record of the individual’s 
answer to each problem was supplemented by a record of the 
time he took to solve it (or to give it up), but without his 
bearing the burden of responsibility for the accuracy of the 
time-record. The form, the medium, the subject-matter and 
the intellectual level of the test would be varied, as would the 
relationship between the level of the test and the level of the 
subjects. 

In this way, data could be gained on the various criteria of 
‘difficulty’ and the extent of their associations, positive or 
negative, with one another. The intelligence of a subject is 
sometimes estimated in terms of the complexity of the prob- 
lems he solves (ignoring the possibility that what is in fact 
complex for any given individual is not always predictable) ; 
sometimes in terms of the speed with which the subject 
solves given problems; and sometimes in terms of a combina- 
tion of these. If A and B both solve problem P correctly but 
A solves it faster, he is usually considered more intelligent 
than B. It should follow then that the difficulty of a parti- 
cular problem may be estimated in terms of correctness of 
response or of time taken or of a combination of these. 

__ These two criteria of difficulty may well not agree. There 
is no reason to suppose that the problems which yield the 
highest proportion of errors will also be those that demand 
the longest time for their solution. Cane and Horn working 
with problems of spatial perception, found no simple 
relationship of this kind.? Their results suggested that there 
may be a difference between the time taken by the members 
of the group who answer a given problem correctly and the 
Pri Sako Pa ‘Speed of Work in Intelligence Tests’ (1938), Brit- J. 


? Cane, V. R. and Horn, 


WS imi tial 
Perception Questions’ (1951), Q The Timing of Responses tp Spatia 


), Quart. J. exp. Psychol., ut, 3- 
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time taken (by other members of the group) to answer the 
same problem incorrectly. In this event, the mean time spent 
by the total group on that problem may well appear 
unrelated to ‘difficulty’, as assessed merely by the proportion 
of subjects correctly answering the problem. 

There might be a further way in which a record of time 
spent on question per subject would prove valuable; that is, 
as an index of variability or flexibility. Itis possible that the 
more intelligent subjects will vary in respect to the length of 
time they are prepared to spend on different tasks. This is 
clearly soin non-laboratory situations. The intelligent, more ` 
or less wittingly, devote time to solving a problem in accor- 
dance with the degree of their interest, the complexity of the 
problem and the gain they are likely to have from achieving 
asolution. The less intelligent tend to be more rigid in this 
as in other situations. They differentiate less and they are 
perhaps more liable to adopt an all-or-none attitude: ‘No, 
it’s no good—I can never do this sort of thing’ or, alterna- 
tively, ‘I don’t care how long it takes me—I’m going to get 
it” Some people tend to adopt the one or other of these 
attitudes in many unlike situations. 

_ This distinction (in terms of time taken) between the 
rigid and the more adaptable approach to problems may or 
may not be useful. The experiments described above would 
yield the data necessary for verifying the hypothesis as 
applied to mental tests. The record of times taken by each 
individual over each question would allow for a reasonably 
objective estimate of flexibility to be made and this estimate 
could then be compared with the available external criteria 
and with the subject’s total score on the test. 

The notion of equating greater and less intelligence with 
greater and less flexibility is of course an old one. But this 
Particular interpretation of flexibility is, I think, new; and 
these means of assessing it are less artificial than some of the 
more direct methods which have been tried. 
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There is a further advantage to be gained from recording 
individual times in addition to answers: the practicability of 
producing genuinely parallel tests. Most so-called parallel 
tests differ slightly in their norms and, to the best of my 
knowledge, none of their devisers has taken time into account, 
apart from imposing the customary overall time limit. The 
production of tests which are parallel—question for question 
on the two-fold criterion of proportion answered correctly 
and time taken—would enable some long overdue experi- 
ments on transfer of training to be initiated. 

The literature on transfer of training is voluminous. The 
work includes accounts of experiments on transfer effects in 
motor, perceptual and cognitive tasks, and a great deal of 
useful and well-documented material has been collected on, 
for instance, knowledge of results and some of the other 
underlying principles of learning. The data in the cognitive 
field are conflicting, however, and they are probably 
nowhere more conflicting than on the topic of intelligence 
tests. I should like to suggest inquiries in which the 
conditions for maximum and minimum transfer, and also 
for ‘negative transfer, were investigated. It might be 
hypothesized, for example, that the greatest transfer would 
be effected between tests with the greatest similarity. This 
immediately raises the question as to what constitutes 
psychological similarity. Are tests P and Q—respectively 
verbal analogies and verbal classification—more, or less, 
similar than tests P and R—respectively verbal analogies 
and diagrammatic analogies—for example? Judging from 
Vernon’s articles,1® he would consider the latter more 
similar. But it is surely a matter which merits experimenta- 
tion rather than speculation. 

It might further be hypothesized that negative transfer is 
most likely to occur, for instance, from an easy to a difficult 


1 (a) “Intelligence Testing’ (1952), Articles and letters reprinted fi 
) printed from 
the Times educ. Supp. See, e.g. Vernon, Pp. 3-9, and Wiseman, f- 27, On 
respective effects of practice and coaching; (b) See p. 7, same pamphlet. 
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test or from an untimed to a timed test, or from a test of one 
medium and one set of instructions to another of the same 
medium but different instructions, necessitating a change of 
set’. These hypotheses again have some theoretical as well 
as practical interest and would not be difficult to test. 
_ Perhaps the most important of the problems of transfer 
in mental tests are those concerned with generalized transfer, 
that is, the possibility of transfer between widely differing 
tests and between test and non-test situations. Opinion here 
is extremely conflicting: it ranges from a striking transfer 
effect found between very unlike tests? to a denial of any 
transfer worth noticing, save that between ‘exactly parallel 
tests’. On the other hand, this denial was coupled in one 
instance with a suggestion that ‘skilful use of intelligence 
test material’ might ‘improve the ability of children or 
adults to use their brains effectively’.? There is evidently 
scope for a good deal of research here. 

Further work which could well be incorporated with the 
above suggestions would include experiments on knowledge 
of results, with and without explanation. This might help to 


reconcile some of the conflicting views on the relative merits 
on the importance or un- 


of coaching and practice, and 

importance of the coacher’s personality and methods, All 
these inquiries should be conducted bearing in mind the 
problem of the relation between group and test, and of 


motivation. Mandler and Sarason have shown, for instance, 


how stress may affect subjects differentially.» The conflicting 


results found by investigators in this field are due largely to 
their oversimplification of the issues and, hence, of their 
interpretations of the data. 
1 Heim, A. W. and Wallace, J. G-» ‘The Effects of Repeatedly 
Retesting the Same Group on the Same Intelligence Test: m. High 
rade Mental Defectives’ (1950), Quart. J. exp- Psychol., 1, 1. 


* ‘Intelligence Testing’ (1 i 
952), Op- cit., P- 9. m se? 
* Mandler, G. and Seon g. B., ‘A Study of Ansiety and Learning 


(1952), J. abnorm. soc. Psychol., XLVII, 2- 
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Another suggestive finding of Cane and Horn, in the same 
paper was that the formulation of the question—multiple 
choice or creative answer type—proved to have some bear- 
ing on its ‘difficulty’, whatever the criterion adopted. This 
indicates that research specifically on the presentation of 
problems in intelligence testing might be fruitful. In 
contemporary group tests of intelligence, true-false and 
multiple choice types of question are used almost exclusively, 
owing to their advantages of objectivity and speed of mark- 
ing. It would be possible, and it might be very rewarding, 
to conduct experiments in which the same problems were 
presented in these two forms and also in a more ‘extrapola- 
tive’ form,! to different groups. i 

These studies might well be extended to include experi- 
menting with the types of solution offered in multiple choice 
questions. It would be interesting, for instance, to vary the 
degree of relevance of the various incorrect solutions to a 
given problem and the extent of the differences obtaining 
between the solutions. The same (multiple choice type) 
question might be presented in an easy, a harder and a very 
difficult form, by altering the number of, and the relations 
between, the given solutions. Thus ‘difficulty’ might be 
shown to inhere in the presentation of the answer no less 
than in the question itself. 

It is evident from some of the foregoing illustrations that 
the concept of ‘difficulty’ is closely related to the concepts of 
speed’ and ‘power’. I have already suggested that the 
speed /power dichotomy is untenable (pp. 119-21) since a 
change in the test instructions, or in the test-group relation- 
ship, may completely alter the emphasis. The experiments 
outlined above, in which the time required for each problem 
by the subject is recorded in addition to his answer, might 
yield new and interesting data on the relation between rate 
of work and correctness of response. The results might do 


* Bartlett, F. C., The Mind at Work and Play (1951), pp. 124-39- 
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little more than to point the undesirability of hasty general- 
izing on the speed versus power question. They might also, 
however, underline the fact that subjects who reach the same 
conclusion do not necessarily do so via the same route. 
Certain diagrammatic problems, for instance, may be ‘seen’ 
intuitively by one subject—who cannot explain but knows, 
with swiftness and certainty, that solution A is the right 
answer. The same problems may have to be worked out 
laboriously by a neighbouring subject, who has to reason 
step by step. It might even be worth while persuading sub- 
jects of varying ability and experience to produce introspec- 
tive, or retrospective, reports as they make their way through 
a formal test. 

Most of these suggestions for further research rely on the 
existing type of intelligence test, which consists of a series of 
problems primarily in deductive reasoning, usually of the 
multiple choice variety. The advantages of such tests as 
instruments of research are similar to their advantages in the 
fields of education and vocational selection: speed in adminis- 
tering and marking, objectivity in scoring, etc. We have 
seen that it is not always easy to determine what exactly they 
measure, even in ideal circumstances. But when the tests 
are devised with insufficient care, so that some questions are 
ambiguous, and two or more (or none) of the proffered 
solutions are defensible as ‘the correct answer’, then it is very 
likely that the test is measuring little more than the ability 
of the subject to identify himself with the deviser of the test. 

I offer two examples of this type of test question, from an 
exceedingly wide choice. Both are culled from a well 
established test, intended for subjects of high intelligence. 


(a) (b) tc) (a) te (f) (9) 
[rue A=PpA +f 
2 
a 
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A case may be made here for figure (e), on the grounds 
that it is a compressed version of (c), with resultingly greater 
curvature of its sides. This is one way of describing (b) in 
relation to (a). On the other hand, a case may be made for 
figure (f), on the grounds that it is a compressed and inverted 
version of figure (c) and that (b) may be similarly described 
in relation to (a). 

My second example is a verbal classification type of 
question. Classification questions are perhaps the most 
liable to invalidity; yet they are popular with devisers of 
tests, especially those concerned with high-grade subjects. 


Candle. Sun. Moon. Electric light. Gaslight. 


The task is to select ‘the word or phrase most unlike 
the others in meaning’. At first sight, Sun seems the 
obvious answer since it is the only primary source of energy. 
On second thoughts, the intelligent subject asks himself 
whether Moon is not as good or better, being the only 
source of reflected light. In desperation, he may finally 
plump for the Candle as the only one which melts as it gives 
out light. 

There is no need to multiply examples. Such slovenliness 
in the devising of test questions is frequently found and its 
frequency increases as the intended level of difficulty rises. 
Undoubtedly, such questions are more difficult than those 
based on cogent reasoning: they are in fact insoluble. But 
if any relation subsists between the score on these questions 
and the subjects’ intelligence, it is likely to be negative, 
since the duller subjects will be less liable to see and to worry 
over the difficulty and a fair proportion of these may be 
expected to choose the ‘right answer’ which the tester 
happened to have in mind. Brighter subjects are more 
critical and more aware of subtleties. They are therefore 
likely to penalize themselves by losing time on such questions 
and, finally, giving several answers or none. A 
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These are extreme instances. There are many questions, 
especially in high-grade tests, whose inadequacy is less 
glaring but which nevertheless perplex the intelligent sub- 
ject just because he is intelligent. It seems to me that a 
valuable contribution to future research on intelligence 
would be to persuade the designers of tests to spend more 
time and trouble on the early stages of test construction. It 
is strange that they are prepared to devote so much care to 
analysis of the test results—whether at the stage of item 
analysis, for instance, or of factor analysis—and so little on 
the questions which yield these results and on which all their 
subsequent work depends. This would not, strictly speaking, 
constitute new experiments, but it is surely a prerequisite to 
further studies along lines using the intelligence test as an 
Instrument of research. 

I should like, however, to suggest further work on intelli- 
gence, not exclusively by means of formal tests. Most of the 
criticisms I have raised have one point in common: that the 
individual as a living person is being more and more 
neglected and that various attributes—in particular 

intelligence’-—are being studied in vacuo, in some instances 
to be later related back to the individual, but in an arbitrary 
way. Binet did not have this tendency but almost all of his 
successors have had itand have wandered farther and farther 
away from Binet’s clinical approach. 

Qualitative studies on a few subjects might usefully replace 
the practice of quantitative studies on vast numbers. The 
latter vogue has had its uses but it has outstayed its welcome 
and now gives evidence of sterility. Perhaps there is already 
a move in a more qualitative direction. i 

In the discussion on time spent per subject per question, 
the problem of temperamental differences among individuals 
Was not raised, although it is clearly relevant. The obses- 
Slonal, for instance, is liable to take far longer over all 
questios than his fellow subjects, whatever his intellectual 
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calibre may be. It has been shown also! that current 
emotional stress of any kind affects test performance ( just as 
it affects every other aspect of behaviour), but not necessarily 
always in the same direction. Such matters are of ee 
importance to the theory of intelligence and the pee ene: 

intelligence testing. But their experimentation requires the 
acceptance of the subject as an integrated person: not as 
No. 318 among six hundred subjects nor as one of the na 
producing entities from which g or several s’s may be 
extracted. wi 

At present, a dissociation obtains between the sic re 
minded psychologists and the factorially-minded psycho! ee 
gists which would, if it manifested itself in the behaviour © 
one individual, cause his friends and relatives the gravest 
concern. The factorists need the findings of the clinicians to 
guide them in their planning of experiments and their inter- 
pretations of results, if these are to have any value outside a 
small, esoteric, ‘operational’ circle, 

The necessity of studying intelligence as part of the total 
personality again obtrudes itself in any consideration of 
differences among the highly intelligent. Let us take five 
hypothetical scientists, all of whom gained first-class honours 
in their university examinations and have obtained I.Q.'s 
of about 160. “Objectively’, there is nothing to choose 
between them as to intelligence. Yet their colleagues are 
generally agreed that their quality of mind varies enormously 
and that they would consider very carefully which to consult 
about some intellectual or technical problem. 

A has an exceptionally clear mind. He tends to rephrase 
your queries in simple or symbolic terms. He fits your 
problem into a ready-made Category and can therefore give 
you his answer very quickly. He is impatient if you suggest 


1 Honzik, M. P., MacFarlane, J. W, d 
Mental Test Performance bi z A 


Allen, L., “The Stability of 
J. exp. Educ., pp. 310-24. 


tween two and eighteen years’ (1948), 
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that the problem is less tidy than he suggests and he cannot 
easily spare the time to deal with people who ‘make 
difficulties’, He is a useful adviser when you need a clear-cut 
solution in a short time. 

Yet B is more often consulted by his colleagues. B is less 
brisk, less clear and less definite. He spends more time 
listening than talking and perhaps his main virtue is the way 
he induces you to clarify your own ideas. He is more likely 
to ask further questions than to offer a solution. But you are 
just as liable to have solved your problem after a session with 
B as after a session with A. 

C probably has greater factual knowledge than either A 
or B. He is widely read and has a strikingly retentive 
memory; moreover he connects what he retains. He is 
respected as a particularly well-informed member of the 
department, but consulting him is apt to prove discouraging 
for he gives the impression of never listening and seldom 
hearing what you say to him and also of being keener to prove 
himself right than to solve the problem in the way most 
satisfactory to you. He will often provide you with references 
in addition to, or instead of, giving his own views. Talk with 
G rarely takes the form of a genuine discussion; and you 
constantly feel that you are failing to make contact with him 
although C himself appears unaware of this. 

D became interested in politics some years back and since 
then manages to relate all problems to his particular brand. 
This involves distortion in some instances and the inclusion 
of irrelevancies in others. On the whole, however, the 
relating is done skilfully and it sometimes gives rise to certain 
interesting hypotheses. Since D’s basic premises are formu- 
lated more explicitly than those of many scientists, it is 
possible to guess his answer to a good many problems; 
and even if you disagree in advance with this answer you 
often find a discussion with D stimulating and thought- 


provoking. 
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Eis the most irritating, the hardest to understand and, for 
some people, the most valuable consultant of the five. His 
response will depend (to a greater extent than others’) on his 
mood and no one is quitesure what determines hismood. He 
ranges from vague Suggestions on apparently irrelevant 
topics to acute criticisms of your formulation of your own 
problem. When he makes suggestions, they are usually 
highly original and they appear to result from guesswork 
rather than rational thinking. He often seems to skate 
around the problem and it is only a day or two later that 
you perceive the significance of his remarks, He has the 
greatest standard deviation and the greatest unreliability of 
the five. Yet in general his advice wears better than that of 
anybody else. 

These five pen pictures may serve to illustrate several 
points. First, they emphasize the inadequacy of differentiat- 
ing between intelligent people, with respect to their mental 
powers, on a linear scale. On such a basis the five scientists 
have been shown to rank equally yet they evidently vary 
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because each subject would be presented with the same 
situation, or series of situations and, it is hoped, in due 
course certain norms would emerge. Butno time limit would 
be imposed; the test could not be transformed into a group 
test without completely and obviously altering its structure; 
creative answers, as opposed to multiple choice, would be 
required; and the questions would not be concerned 
exclusively with deductive reasoning. 

The kind of problems presented would include social, 
practical, athletic, artistic and ethical problems: for example, 
what to pack, whether to buy on the grey market, how to 
express a criticism, where to spend a holiday, whether to 
return the tennis ball oneself or leave it to one’s partner, 
which film to see . . . the details for each hypothetical 
problem being very clearly defined. Thus the taste and the 
judgment of the subject would be appraised, rather than his 
logicorhiscapacity fordivining thesolution preselected by the 
tester. There would be neither right and wrong answers, nor 
varying degrees of rightness and wrongness. The solution 
would be as much a matter of personal choice as is the 
selection of a picture to hang in the sitting-room or of hors 
d’ceuvre for lunch. On the other hand, such a test situation 
would differ from projective techniques such as TAT: or 
Rorschach in that a specific set of problems would be 
presented, each demanding some sort of solution. The sub- 
illing to choose a particular course of action 
would be encouraged to explain his difficulties rather than 
forced to make a decision. All that the subject says and does 
would be recorded, as would the time he takes to deal with 

i oblems. 
or K A it might be possible to build up a general 
picture of the quality of the subject’s mind: his way of 
thinking (oF jntuiting, or assessing, or feeling), his natural 
tempo, his sense of values and of humour, the extent to 
which he does use deductive reasoning when this decision is 
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left to him, his originality (comparing him with other 
subjects), his versatility (comparing his solutions one with 
another), the breadth of his interests. 

Another, somewhat more structured, test situation would 
be to present the subject with, say, three prose passages in 
which the same—unfamiliar—word appears, but whose 
contexts vary as much as possible. The word might be one 
invented by the tester or, better, a real but unusual word 
which is known to be lacking from the vocabulary of the 
subject. This could be ascertained in a suitably designed, 
preliminary test. He would be asked to suggest a meaning 
for the word, after reading or hearing respectively a first, a 
second and a third passage and, finally, studying all three 
passages together. If the passages were skilfully chosen, this 
again should give some measure of the intellectual level of the 
subject, mixed—as it inevitably must be—with his general 
outlook, his personality, his mood during testing, his interests 
and his motivation. 

A third suggestion is that the subject be presented with 
selected single pages from a number of books and papers—_ 
whose writing should be as varied in style and matter as 
possible—and asked to supply the word with which the 
subsequent page begins. In another experiment the subject 
might be asked to complete the whole sentence or paragraph. 
This kind of test, again, might yield information on the 
quality as well as the level of intelligence of the subject, on 
his adaptability, originality, interests, fluency and mood. It 
might also indicate those people who have a flair for saying 
or doing the exactly appropriate thing, often without know- 
ing why, even sometimes not realizing how peculiarly 
apposite is their response. 

It might be possible to present test situations such as I 
have outlined in other forms, for instance using auditory and 
pictorial and even tactual stimuli. In all cases, an attempt 
would be made to get the subject to ‘think aloud’: to 
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introspect and to retrospect. The psychologist, even one 
who is concerned primarily to appraise intelligence, should 
be at least as interested in the process of studying problems 
as in the final solution (or lack of solution). N 

Such experiments need not be confined to highly intelli- 
gent, otherwise ‘normal adults. They might well be 
extended to adolescents and children and to patients in 
mental hospitals. And the tests suggested should not of 
course be given in isolation. They should be combined with 
the other criteria which I have discussed elsewhere: bio- 
graphical, psychiatric and psychometric. But the aim 
would not be primarily to gain data on various attributes 
of the subject in order to compare him on similar attributes 
with his fellow subjects; rather, it would be to gain as vivid 
and complete a picture as possible of the subject in his own 
right. It seems to me that this should, at least for a time, be 
regarded as an end in itself. 

The dissociation on which I commented earlier is observ- 
able also with respect to motivation in psychology. The. 
comparative psychologists have a long history of experi-. 
ments designed specifically to examine the working and the: 
strength of certain motives, drives, incentives. In this work,, 
which has been largely but not exclusively confined to 
animal psychology, motivation has been considered a suit~ 
able focus for experimentation, The psychometrist, however; 
has in general proceeded on the working hypothesis of 
identical motivation for all subjects presented with identical 
intelligence tests. Motivation is rarely, if ever, invoked, for 
example, as a source of variable error in discussion on the 
unreliability of tests. 

Human motivation is a baffling topic: the kinds of incen- 
tive and deterrent which the psychologist can provide tend 
to strike the subject as artificial or trivial; even if they are 
genuine and ego-involving, they will have unlike intensities 
for different subjects; and even if it were possible to achieve 
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equal subjective intensity, this would probably have dif- 
ferential effects on performance. This may explain both the 
neglect of problems of motivation in most studies involving 
intelligence testing and the urge to increase test reliability 
and validity by purely statistical means. I believe, however, 
that work on the role of motivation in mental testing might 
be feasible, especially if the problems were tackled indirectly 
and via the individual. It might also be combined with the 
work on transfer of training outlined above (pp. 144-5)! 
differential motivation might be introduced (in some of the 
test groups and some of the control groups), in addition to 
differential training and practice on a series of parallel tests. 

In such experiments, if the ‘highly motivated’ groups did 
show moreimprovement than the ‘poorly motivated’ groups, 
it would be legitimate to draw certain definite conclusions, 
at least about learning in tests. If, however, the groups 
showed no differences of any importance, an interpretation 
of ‘no genuine differences in motivation were achieved’ 
could compete with one of ‘test performance is relatively 
unaffected by differences in motivation’. 

These various suggestions for further work on intelligence 
seem to point along two routes, My earlier proposals con- 
cern experiments which would make use of existing intelli- 
gence test techniques: investigating new concepts of difficulty 
and of flexibility, experimenting with the number of, and 
degree of relevance of, the solutions in multiple choice 
problems and exerting unusual care in the initial construc- 
tion, especially of high grade tests, My later proposals are 
concerned with a new approach to the appraisal of intelli- 


Suggestions for Further Work 157 


My two approaches have substantially the same aim: the 
appraisal of the quality of mind of the subject. Most of my 
suggestions are intended as additions to, rather than sub- 
stitutes for, current methods. There is little doubt that 
current methods are invaluable when time is short and 
numbers are large; by means of them we can make the best 
of a bad job. I wish to stress, however, that it is not—or 
should not be—the only job with which the psychologist 
interested in appraising mental abilities should concern 
himself; nor are these means the only means of obtaining the 
required information. 


Y 


Chapter Thirteen 


WHERE ARE WE? 


I have criticized current trends in the testing of intelligence, 
especially in the methods of devising and validating the 
tests, and of interpreting and applying the results. My 
criticisms may have appeared predominantly destructive, 
save, perhaps, for Chapter XII in which I proposed lines 
of research which differ somewhat from the classic approach 
and which might or might not yield useful material—after 
a good deal of work. In any event, it is clear that most of my 
suggestions would have little value for large-scale testing 
and that the present methods are unlikely to sustain any 
radical change during the next few decades. In this last 
chapter, then, I should like to suggest ways in which existing 
techniques might be improved and, finally, to return to the 
concept of intelligent activity, as discussed in an earlier 
chapter. 

, An intelligence test can be an exceedingly valuable 
instrument. When it is satisfactorily constructed, validated 
and administered, and its nature and level are appropriate 
for the subjects concerned, it is in general the best single 
means of estimating intelligence, in a short time. It provides 
a more accurate prediction than, say, an interview or a bio- 
graphical inventory (again singly) as to whether an adult is 
likely to succeed in a given job of known grade or whether 
a nine-year-old’s failure to master reading is mainly due to 
his low mental capacity. In fact the test will yield a useful 
appraisal of the subject’s Capacity to grasp essentials and to 


respond appropriately to them, whatever the relevant 
situation may be. It is ve 


L Ty impressive that such complex 
and important data can be obtained in an hour or lesg, when 
requisite conditions are fulfilled. R 
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If, however, the test is so ‘pure’ and so ‘reliable’ that suc- 
cess in doing it demands acceptance of a set of highly 
specialized (and sometimes questionable) conventions; if 
its ‘validity’ has been determined as some timeless, intrinsic 
quality, and established without reference to any external 
criterion; if the test is given by somebody with no interest 
in the subjects as individuals and little understanding of the 
difficulties the test may present to them; if the relation 
between type of test and type of group is ignored (for 
example, tests intended for random samples of population A 
being administered to highly selected, intelligent sub-groups 
of population A or pictorial problems being presented to 
subjects who are unused to pictures); if the test scores are 
recorded in an inapposite way, such as in terms of I.Q. for 
adults; if the results are regarded as fittingly expressed as a 
simple figure or ranking, correct to one or two decimal 
places, and this simple expression alone is used to determine 
the future of the subject who produced it: in the event ofany 
one or more of these contingencies, the intelligence test will 
no longer be a valuable instrument and it may, in some 
circumstances, even be misleading. Its guise of scientific 
respectability renders it doubly dangerous and, the greater 
the emphasis laid on its objectivity and extreme precision, 
the more dangerous—because the more rigid—the sequelae. 
According to the extent of the technique’s misuse, it may just 
slightly mislead or it may do so very considerably. 

The first practical point to stress has been made by the 
majority of psychologists with any interest in testing, namely, 
that intelligence test score should never be used alone when 
making an individual assessment for practical purposes. 
Whether the particular problem be educational, vocational 
or clinical, intelligence test performance should always be 
used as an additional aid rather than a sole means; and this 
holds good even if the practical problem is primarily the 
appraisal of the subject’s intellectual capacity. However, I 
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should maintain, with equal vehemence, that to attempt a 
psychological assessment of an individual, for whatever 
reason, without including at least one standardized test of 
intelligence would be as grave an error as to use such a test 
by itself. It would be very foolish to neglect any of the 
potentially useful means of assessment—and unforgivably 
foolish to neglect the most valuable of these. 

Thusit is clear that if we are concerned to assess someone’s 
intellectual calibre we should use many, if not all, of the 
relevant methods; and, bearing in mind our views on the 
indivisibility of mental life, we should not limit ourselves to 
methods of estimating ‘intelligence’. We should include a 
general case history, educational and occupational records, 
assessments made by individuals standing in various 
relationships to the subject, and tests of a more specific 
character than those termed general intelligence tests. This 


brings me to the second point I wish to stress: the importance 
of discrepancies, 
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than was to be expected on other grounds, he may have 
been unwell or upset at the time of the test; or perhaps his 
various ‘assessors’ were over-partial or over-anxious to do 
him well; or the subject may have had his own reasons for 
wishing to appear unintelligent. On the other hand, if he 
gains a surprisingly high score, it is possible that he is as men- 
tally alert as he appears and that the test situation, unlike ‘real 
life’ situations, has permitted him to demonstrate this— 
perhaps because the test is divorced from everyday problems 
which often demand knowledge, wisdom and originality, or 
perhaps because he finds himself temporarily free from 
emotional disturbances during the brief testing period. It is 
of course possible that the subject will have had a good deal 
of experience with psychological tests and that this will 
account for his impressive performance. These examples 
by no means cover all the reasons for conflict between 
intelligence test scores and other criteria, But they may 
indicate something of the interest which attaches to the 
discrepancies and the importance, diagnostically, of follow- 
ing them up. 

It may be objected that I have been considering intelli- 
gence testing only with a personal problem in mind: that 
psychometrists are often concerned with testing large groups 
for purely theoretical reasons—to establish norms for a new 
test or to correlate it with an old test, to ascertain its relations 
with age or family size or to determine its g saturation. This 
is true. In my opinion it is lamentably true since, sooner or 
later, the scores of individual subjects, tested initially with 
such ends in view, are likely to be taken seriously. Moreover, 
apart from the practical objections, such procedures may do 
a great deal of harm to theoretical psychology. 

Let us take, for instance, my last illustration, in which a test 
is given in order to ascertain its g saturation. The new test 
is included, say, in a battery of tests which is given to some 
thousand subjects. Its g saturation is found to be high. It is 
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therefore accepted asa valid test of intelligence. The next step 
is to define intelligence in terms of performance on the test. 

In fact, a definition or description of intelligence based 
exclusively on the results of contemporary intelligence tests 
is not acceptable. To accept such a meaning is to infer that 
originality and creativity have nothing to do with intelli- 
gence; that the mediocre physicist or engineer will often 
prove ‘more intelligent’ than the eminent poet or historian; 
and that the average forty-year-old will tend to behave 
‘less intelligently’ than the average sixteen-year-old. More- 
over, it would be possible to learn to be more intelligent, 
with practice and training—a conclusion far more at vari- 
ance with the tenets of the psychometrist than with my own. 
If, however, intelligence be accepted as varying qualitatively 
with different individuals, and for the same individual on 
different occasions, and if, further, such concepts as social 
intelligence and practical intelligence be accepted; then the 
development and modification of intelligent behaviour even 
in adulthood is seemly. 

The word ‘intelligence’ is useful, and its meaning though 
vague is tolerably clear to the man in the street and was sO 
to psychologists until the advent of psychometrics. It should 
retain a certain measure of vagueness because, as is often the 
case with psychological terms, what it refers to is not narrow 
and clear cut. With the attempts to banish vagueness, 
ambiguity (much more sinister) has been introduced: 
intelligence now has at least two quite distinct meanings, one 
highly specific and technical, the other general and popular. 
Sometimes one and sometimes the other is invoked in an 
endeavour to prove a point. 

I may now return to my original description of intelligent 
activity as consisting of the grasping of essentials and 
responding appropriately to them. The essentials may in one 
situation (such as an interview or a tea-party) be predomi- 
nantly of a social kind; in another (such as a desert island or 
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a prison camp) they may be predominantly practical; in 
another (such as interpreting a painting or criticizing a 
symphony) they may be predominantly aesthetic; in another 
(such as a logic examination or a debating society) they may 
be predominantly of a deductive nature; in another, they 
may be unknown. 

In many non-laboratory situations the essentials are very 
mixed. The most intelligent sort of person is he who is able 
to respond the most appropriately in any given situation; 
and, especially, he who can grasp and respond to the essen- 
tials in situations which baffle others. It is true that deduc- 
tive reasoning, which constitutes the essence of most formal 
intelligence tests, is of major importance in many unlike 
situations, and that people who markedly lack this and yet 
excel in other fields are comparatively rare: hence the 
general agreement found between intelligence test results 
and other criteria. But neither this tendency nor the fact that 
such tests are far easier to devise and score than other 
varieties of mental tests, justifies the exclusion from ‘intelli- 
gence’ of the other types of essential. 

The question whether extremely appropriate individual 
behaviour in a limited field indicates greater or less intelli- 
gence than a lower degree of appropriateness in a wider 
field may provide entertaining, but purely academic, 
discussion. Its possibility underlines yet again the artifici- 
ality of a meaning which reduces differences in intelligence 
to quantitative differences only. 

The essentials vary, then, with the situation. The appro- 
priate response naturally varies with the essentials and may 
vary to some extent with the individual concerned. Speed _ 
is not altogether ignored since in some circumstances a slow 
response will, by virtue of its slowness, prove to be inappro- 
priate. In those situations where speed is unimportant, a 
slow response may be as appropriate (and, therefore, as 
intelligent) as a faster response. 
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Thinking is a good example of intelligent activity but it is 
only one example chosen from among several. Thinking 
does not include, for instance, inspired guessing (though the 
former may precede the latter) or impulsive actions. Yet 
the person who is prone to inspired guesses or to pertinent 
impulsive actions would by common consent be described as 
intelligent. aa 

Psychologists often draw a distinction between thinking 
or insightful behaviour on the one hand, and trial and 
error behaviour on the other. In my view this distinction is 
invalid since it suggests that the two are opposed to each 
other. They seem to me to be different only in degree of 
overtness: all problem-solving is essentially trial and error 
activity but in so far as the entertaining and rejecting of lines 
of action are implicit and rapid, they tend to be unrecognized 
by the observer, and perhaps also by the unintrospecting 
subject—who is often able to retrospect them later. I should 
not even be justified in asserting that the more implicit the 
trial and error the greater the intelligence of the subject, for 
a more overt subject in a similar situation might succeed in 
reaching a neater or a more complete solution. d 

On the other hand, it is probably legitimate to discrimi- 
nate between the subject who confines himself to consider- 


ing (implicitly or explicitly) only reasonably apposite 
solutions and 


» order. The former has demonstrated, 

d S co » his superiority in grasping 
essentials. The Possibility of this sort of distinction is very 
igence tests of the multiple 


y those whose proffered solutions to 
problems are all reasonably plausible, 


Intelligent activity is manifested in varying kind and 
degree by adult human beings, by children and infants, and 
by the lower animals. The notion of a linear scale from 
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human geniuses downwards is unacceptable owing to the 
immense individual variation, the wide range found within 
the species and their consequent ‘overlapping’ one with 
another, and the way in which their particular gifts vary. 

An intelligent dog trained to lead the blind will behave 
more intelligently in certain situations than a normal four- 
year-old child or a mentally defective adult. He will, for 
example, roughly estimate his master’s height and so lead 
him as to avoid hitting his head against an outjutting 
obstruction, which the dog alone could pass under easily; he 
will also ‘generalize’ to the extent of conducting his master 
to the nearest grocer or post office, in a town which is strange 
to both man and dog; yet he will in no circumstances learn 
to spell. Similarly, a cat who requires many trials to learn to 
unfasten the catch of a puzzle box may learn in one or two 
trials the shortest series of jumps to the attic window which 
is always left open. 

If we’ wish to gain some understanding of intelligent 
activity we should recognize its complexity and variation 
and adapt our methods to observable phenomena rather 
than attempt to restrict the field by restricting ourselves. We 
shall appraise more accurately if we concentrate less on 
measuring precisely. Contemporary theories contain a 
vital germ of the truth but at present they include both more 
and less than the whole truth about the appraisal of intelli- 
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